端到端语音综合的可控情绪转移

论文标题

端到端语音综合的可控情绪转移

Controllable Emotion Transfer For End-to-End Speech Synthesis

论文作者

Li, Tao, Yang, Shan, Xue, Liumeng, Xie, Lei

论文摘要

从参考文献中学到的情感嵌入空间是一种直接的方法，可以在编码器造型的情感文本中转移到语音（TTS）系统中。但是，在综合语音中转移的情绪在情感类别的困惑中不够准确和表现力。此外，很难选择适当的参考来提供所需的情感力量。为了解决这些问题，我们提出了一种基于TACOTRON的新方法。首先，我们插入两个情绪分类器（一个又一个之后的分类器，一个是解码器输出之后），以增强情绪嵌入和预测的Mel-Spendrum的情绪歧视能力。其次，我们采用样式损失来衡量生成的MEL-SPECTRUM之间的差异。可以通过调整情感嵌入的价值来控制综合语音中的情感强度，因为嵌入情绪可以看作是Mel-spectrum的特征图。关于情绪传递和力量控制的实验表明，所提出的方法的合成语音更准确，表现力，情绪类别的混乱较少，而对情感强度的控制对听众更为重要。

Emotion embedding space learned from references is a straightforward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers -- one after the reference encoder, one after the decoder output -- to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.

下载PDF全文

下载文献需遵守相关版权规定

论文标题