自回旋语音产生的平行合成

论文标题

自回旋语音产生的平行合成

Parallel Synthesis for Autoregressive Speech Generation

论文作者

Hsu, Po-chun, Liu, Da-rong, Liu, Andy T., Lee, Hung-yi

论文摘要

自回归的神经声码编码器在语音综合任务（例如文本到语音和语音转换）方面取得了出色的表现。自回归的Vocoder预测样本在某个时间步长以前的时间步骤。尽管它综合了自然的人类言语，但迭代产生不可避免地使综合时间与话语长度成正比，从而导致效率低。许多作品致力于以并联和拟议的基于GAN，基于流量和基于分数的声音编码器生成整个语音序列。本文为自回归一代提出了一种新的想法。该模型没有在时间顺序上进行迭代预测样品，而是通过频率进行自回归产生（FAR）和位范围的自回旋产生（BAR）进行综合语音。在迄今为止，语音话语分为频率子带，并在先前生成的频率上生成一个子带。同样，在BAR中，从第一个位开始迭代生成8位量化的信号。通过重新设计自回旋方法以在时间域以外的其他域中计算的方法，提议模型中的迭代次数不再与话语长度成正比，而是子带/位的数量，显着提高了推理效率。此外，使用后过滤器从输出后期进行采样信号。它的训练目标是基于提出方法的特征而设计的。实验结果表明，所提出的模型可以在没有GPU加速的情况下比实时更快地综合语音。与基线声码编码器相比，提出的模型可以获得更好的穆斯拉结果，并显示出了看不见的说话者和44 kHz语音的良好概括能力。

Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a subband is generated conditioned on the previously generated one. Similarly, in BAR, an 8-bit quantized signal is generated iteratively from the first bit. By redesigning the autoregressive method to compute in domains other than the time domain, the number of iterations in the proposed model is no longer proportional to the utterance length but to the number of subbands/bits, significantly increasing inference efficiency. Besides, a post-filter is employed to sample signals from output posteriors; its training objective is designed based on the characteristics of the proposed methods. Experimental results show that the proposed model can synthesize speech faster than real-time without GPU acceleration. Compared with baseline vocoders, the proposed model achieves better MUSHRA results and shows good generalization ability for unseen speakers and 44 kHz speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题