Nautilus：多功能的语音克隆系统

论文标题

Nautilus：多功能的语音克隆系统

NAUTILUS: a Versatile Voice Cloning System

论文作者

Luong, Hieu-Thi, Yamagishi, Junichi

论文摘要

我们介绍了一种名为Nautilus的新型语音合成系统，该系统可以从文本输入或任意源说话者的参考语音中以目标语音产生语音。通过使用多扬声器语音语料库在初始培训阶段训练所有必要的编码器和解码器，我们的系统可以根据反向传播算法使用目标扬声器的未转录语音来克隆无见的声音。此外，根据目标扬声器的数据情况，可以调整克隆策略以利用其他数据并修改文本到语音（TTS）和/或语音转换（VC）系统的行为以适应情况。我们通过使用深度卷积层来对编码器，解码器和WaveNet Vocoder进行建模来测试所提出的框架的性能。评估表明，在仅使用五分钟的未转录语音克隆时，它与最先进的TTS和VC系统相当。此外，已经证明，所提出的框架具有具有高扬声器一致性的TTS和VC之间的能力，这对于许多应用程序都是有用的。

We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题