论文标题
平行TACOTRON:非自动回旋和可控的TTS
Parallel Tacotron: Non-Autoregressive and Controllable TTS
论文作者
论文摘要
尽管神经端到端的文本到语音模型可以综合自然的语音,但仍有提高其效率和自然性的空间。本文提出了一种非自动回归神经文本到语音模型,该模型通过基于各种自动编码器的残差编码器增强。该模型称为\ emph {并行tacotron},在训练和推理过程中都是高度可行的,可以在现代并行硬件上有效合成。变分自动编码器的使用放松了文本到语音问题的一对一映射性质,并改善了自然性。为了进一步改善自然性,我们使用轻量级的卷积,可以有效地捕获局部环境,并引入迭代频谱损失,灵感受到迭代精致的启发。实验结果表明,平行的塔科底子在主观评估中与推理时间明显减少的主观评估相匹配。
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.