使用以前的声学上下文来改善文本到语音综合

论文标题

使用以前的声学上下文来改善文本到语音综合

Using previous acoustic context to improve Text-to-Speech synthesis

论文作者

Oplustil-Gallegos, Pilar, King, Simon

论文摘要

许多语音合成数据集，尤其是从有声读物中得出的数据集自然包含了话语序列。然而，在训练模型和推理时，这些数据通常被视为个人，无序的话语。这将重要的韵律现象丢弃在话语水平以上。在本文中，我们使用声音上下文编码器利用数据的顺序性质，该声音编码器产生了先前的话语音频的嵌入。这是Tacotron 2模型中解码器的输入。嵌入也用于辅助任务，提供其他监督。我们比较了两个次要任务：预测话语对的顺序，并预测当前发音音频的嵌入。结果表明，连续话语之间的关系是有益的：我们提出的模型在Tacotron 2基线上显着改善了自然性。

Many speech synthesis datasets, especially those derived from audiobooks, naturally comprise sequences of utterances. Nevertheless, such data are commonly treated as individual, unordered utterances both when training a model and at inference time. This discards important prosodic phenomena above the utterance level. In this paper, we leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio. This is input to the decoder in a Tacotron 2 model. The embedding is also used for a secondary task, providing additional supervision. We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio. Results show that the relation between consecutive utterances is informative: our proposed model significantly improves naturalness over a Tacotron 2 baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题