论文标题
使用多模式学习来改善几声的语音克隆
Improve few-shot voice cloning using multi-modal learning
论文作者
论文摘要
最近,很少有声音克隆取得了重大改进。但是,大多数用于少量语音克隆的模型都是单模式的,而多模式的几个弹声语音克隆已经进行了研究。在本文中,我们建议使用多模式学习来改善几声的语音克隆性能。受到无监督语音表示的最新作品的启发,提出的多模式系统是通过使用无监督的语音表示模块扩展Tacotron2来构建的。我们在两个几个弹药的语音克隆方案中评估了我们提出的系统,即文本到语音(TTS)和语音转换(VC)。实验结果表明,所提出的多模式学习可以显着改善其对流单模式系统的少量声音克隆性能。
Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. Inspired by the recent works on unsupervised speech representation, the proposed multi-modal system is built by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech(TTS) and voice conversion(VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.