通过有条件的自动编码器，F0一致的多到许多非并行语音转换

论文标题

通过有条件的自动编码器，F0一致的多到许多非并行语音转换

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

论文作者

Qian, Kaizhi, Jin, Zeyu, Hasegawa-Johnson, Mark, Mysore, Gautham J.

论文摘要

非平行的多与许多语音转换仍然是一项有趣但具有挑战性的语音处理任务。已经提出了许多以样式转移为灵感的方法，例如生成对抗网络（GAN）和变异自动编码器（VAE）。最近，基于条件的自动编码器（CAES）方法AutoVC通过使用信息限制瓶颈来删除说话者的身份和语音内容，从而实现了最先进的结果，并通过在其他扬声器嵌入其他扬声器的身份中交换零弹性转换来实现零拍的转换。但是，我们发现，尽管说话者的身份与语音含量相关，但大量的韵律信息（例如源F0）通过瓶颈泄漏，导致目标F0不自然地波动。此外，AUTOVC无法控制转换后的F0，因此不适合许多应用程序。在论文中，我们将基于自动编码器的语音转换修改并改进了DISENTANGLE内容，F0和扬声器身份。因此，我们可以控制F0轮廓，以与目标扬声器一致的F0产生语音，并显着提高质量和相似性。我们通过定量和定性分析来支持我们的改进。

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题