无监督的文本对语音综合，无监督的自动语音识别

论文标题

无监督的文本对语音综合，无监督的自动语音识别

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

论文作者

Ni, Junrui, Wang, Liming, Gao, Heting, Qian, Kaizhi, Zhang, Yang, Chang, Shiyu, Hasegawa-Johnson, Mark

论文摘要

无监督的文本到语音综合（TTS）系统学会通过观察以下语言生成与任何书面句子相对应的语音波形：1）用该语言收集的未转录语音波形的集合； 2）用该语言编写的文本集合，无需访问任何抄录的语音。开发这种系统可以显着提高语言技术对语言的可用性，而无需大量平行的语音和文本数据。本文提出了一个基于一个对齐模块的无监督的TTS系统，该模块输出伪文本和另一个使用伪文本进行训练的合成模块，并进行推理。我们的无监督系统可以以七种语言的方式实现与监督系统相当的性能，每种语音约10-20小时。还对文本单元和声音编码器的效果进行了仔细的研究，以更好地了解哪些因素可能影响无监督的TTS性能。可以在https://cactuswiththoughts.github.io/unsuptts-demo上找到我们的模型生成的样品，可以在https://github.com/lwang114/unsuptts上找到我们的代码。

An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题