论文标题

理智-TT:稳定且自然的端到端多语言文本到语音

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

论文作者

Cho, Hyunjae, Jung, Wonbin, Lee, Junhyeok, Woo, Sang Hoon

论文摘要

在本文中,我们提出了Sane-TTS,这是一种稳定而自然的端到端多语言TTS模型。由于很难为给定的演讲者获得多语言语料库,因此不可避免地会使用单语言语料库进行多语言TTS模型。我们介绍了扬声器正规化损失,该损失可改善跨语性合成期间的语音自然性以及域对抗训练,该训练适用于其他多语言TTS模型。此外,通过添加扬声器正规化损失,以持续时间为零矢量嵌入的扬声器可以稳定跨语性推断。通过此替代品,我们的模型将以中等节奏的方式产生语音,而不论跨语性合成中的来源说话者。在MOS评估中,Sane-TTS在跨语义和内部合成中的自然性得分高于3.80,地面真相评分为3.99。此外,即使在跨语性的推论中,理智TTS也保持着接近地面真理的说话者相似性。音频样本可在我们的网页上找到。

In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源