无监督的量化韵律表示可控语音合成

论文标题

无监督的量化韵律表示可控语音合成

Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

论文作者

Wang, Yutian, Xie, Yuankun, Zhao, Kun, Wang, Hui, Zhang, Qin

论文摘要

在本文中，我们提出了一种新颖的韵律驱动方法，用于韵律文本到语音（TTS）模型，该方法将矢量量化（VQ）方法引入了辅助韵律编码器，以不经验的方式获得分解的韵律表示。依靠它的优势，诸如音高，口语速度，本地音调差异等的说话样式会自动分解为潜在量化矢量。我们还通过潜在变量计数器调查了VQ Disentangle过程的内部机制，并发现较高的值维度通常代表韵律信息。实验表明，我们的模型可以通过直接操纵潜在变量来控制综合结果的口语风格。客观和主观评估表明，我们的模型表现优于流行模型。

In this paper, we propose a novel prosody disentangle method for prosodic Text-to-Speech (TTS) model, which introduces the vector quantization (VQ) method to the auxiliary prosody encoder to obtain the decomposed prosody representations in an unsupervised manner. Rely on its advantages, the speaking styles, such as pitch, speaking velocity, local pitch variance, etc., are decomposed automatically into the latent quantize vectors. We also investigate the internal mechanism of VQ disentangle process by means of a latent variables counter and find that higher value dimensions usually represent prosody information. Experiments show that our model can control the speaking styles of synthesis results by directly manipulating the latent variables. The objective and subjective evaluations illustrated that our model outperforms the popular models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题