NNSpeech：零击的扬声器自动编码器的扬声器引导的有条件变异式自动编码器文本到语音

论文标题

NNSpeech：零击的扬声器自动编码器的扬声器引导的有条件变异式自动编码器文本到语音

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

论文作者

Zhao, Botao, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Xiao, Jing

论文摘要

在实际应用中，使用一些适应性数据的多扬声器文本到语音（TTS）是一个挑战。为了解决这个问题，我们提出了一个名为nnspeech的零射击多演讲者TT，它可以综合一个新的扬声器声音，而无需进行微调，并且只使用一种适应性话语。与使用扬声器表示模块提取新说话者的特性相比，我们的方法基于说话者指导的条件变分自动编码器，并且可以生成一个可变Z，其中包含说话者的特性和内容信息。潜在变量z分布由另一个在参考mel光谱图和音素上的变量近似。在英语语料库，普通话语料库和跨数据库上进行的实验证明，我们的模型只能通过一种适应性演讲来产生自然和类似的语音。

Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题