在文本中重新审视过度平滑的语音

论文标题

在文本中重新审视过度平滑的语音

Revisiting Over-Smoothness in Text to Speech

论文作者

Ren, Yi, Tan, Xu, Qin, Tao, Zhao, Zhou, Liu, Tie-Yan

论文摘要

由于其快速生成速度，非自动回归文本（NAR-TTS）模型引起了学术界和行业的广泛关注。 NAR-TTS模型的一个局限性是，它们在产生语音旋光图的同时忽略了时间和频域的相关性，从而导致模糊和过度平滑的结果。在这项工作中，我们从新的角度重新审视了这个过度平滑的问题：超平滑度的程度取决于数据分布的复杂性与建模方法的能力之间的差距。简化数据分布和改进建模方法都可以减轻问题。因此，我们首先研究方法降低了数据分布的复杂性。然后，我们对使用一些高级建模方法的NAR-TTS模型进行了全面研究。基于这些研究，我们发现1）提供其他条件输入的方法将数据分布的复杂性降低了模型，从而减轻了过度光滑的问题并实现了更好的语音质量。 2）在高级建模方法中，拉普拉斯混合物损失在建模多模式分布并享有其简单性方面表现良好，而GAN和GLOW则达到了最佳的语音质量，同时又患有增加训练或模型复杂性。 3）可以合并两类方法，以进一步缓解过度平滑度并提高语音质量。 4）我们对多扬声器数据集的实验导致与上述类似的结论，并提供更多的差异信息可以减少对目标数据分布进行建模和减轻模型容量要求的难度。

Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide additional condition inputs reduce the complexity of data distributions to model, thus alleviating the over-smoothing problem and achieving better voice quality. 2) Among advanced modeling methods, Laplacian mixture loss performs well at modeling multimodal distributions and enjoys its simplicity, while GAN and Glow achieve the best voice quality while suffering from increased training or model complexity. 3) The two categories of methods can be combined to further alleviate the over-smoothness and improve the voice quality. 4) Our experiments on the multi-speaker dataset lead to similar conclusions as above and providing more variance information can reduce the difficulty of modeling the target data distribution and alleviate the requirements for model capacity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题