论文标题
BDDM:快速和高质量语音综合的双边降级扩散模型
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
论文作者
论文摘要
扩散概率模型(DPM)及其扩展已成为竞争性生成模型,但面临着有效抽样的挑战。我们提出了一个新的双边脱氧扩散模型(BDDM),该模型通过计划网络和分数网络参数化向前和反向过程,该过程可以以新型的双边建模目标进行训练。我们表明,与传统替代物相比,新的替代目标可以实现对数边际似然的下限。我们还发现,BDDM允许从任何DPMS继承预训练的分数网络参数,因此可以快速稳定地学习计划网络,并优化噪声时间表以进行采样。我们的实验表明,BDDM可以生成具有三个采样步骤的高保真音频样本。此外,与其他最新基于扩散的神经声码编码器相比,BDDM与人类语音无法区分的样本可比性或更高质量,特别是只有七个采样步骤(比Wavegrad快143倍,比Difffwave快28.6倍)。我们在https://github.com/tencent-ailab/bddm上发布代码。
Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave). We release our code at https://github.com/tencent-ailab/bddm.