一种模型，许多语言：用于多语言文本到语音的元学习

论文标题

一种模型，许多语言：用于多语言文本到语音的元学习

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

论文作者

Nekvinda, Tomáš, Dušek, Ondřej

论文摘要

我们介绍了一种多语言语音综合的方法，该方法使用上下文参数生成的元学习概念，并使用比以前的方法产生自然听起来的多语言语音。我们的模型基于Tacotron 2，其权重通过单独的参数生成器网络预测其权重。为了提高语音克隆，该模型使用具有渐变逆转层的对抗扬声器分类器，该分类器可从编码器中删除扬声器特定信息。我们安排了两个实验，将模型与基线进行比较，使用各种跨语性参数共享，以评估：（1）训练低量数据时的稳定性和性能，（2）发音准确性和代码转换合成的语音质量。对于培训，我们根据五种语言的常见语音录音使用了CSS10数据集和新的小数据集。我们的模型被证明可以有效地跨语言共享信息，并且根据主观的评估测试，它比基线产生更自然和准确的代码转换语音。

We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches. Our model is based on Tacotron 2 with a fully convolutional input text encoder whose weights are predicted by a separate parameter generator network. To boost voice cloning, the model uses an adversarial speaker classifier with a gradient reversal layer that removes speaker-specific information from the encoder. We arranged two experiments to compare our model with baselines using various levels of cross-lingual parameter sharing, in order to evaluate: (1) stability and performance when training on low amounts of data, (2) pronunciation accuracy and voice quality of code-switching synthesis. For training, we used the CSS10 dataset and our new small dataset based on Common Voice recordings in five languages. Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题