基于子带的生成对抗网络，用于非并行多到许多语音转换

论文标题

基于子带的生成对抗网络，用于非并行多到许多语音转换

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

论文作者

Ma, Jian, Zheng, Zhedong, Fei, Hao, Zheng, Feng, Chua, Tat-seng, Yang, Yi

论文摘要

语音转换是用源内容和目标语音样式生成新的演讲。在本文中，我们专注于一种一般环境，即非平行的多对语音转换，它接近现实世界的情况。顾名思义，非平行的多对语音转换不需要配对的源和参考语音，并且可以应用于任意语音传输。近年来，生成的对抗网络（GAN）和其他技术（例如条件变异自动编码器（CVAE））在该领域取得了长足的进步。但是，由于语音转换的复杂性，转换后的语音的样式相似性仍然不令人满意。受MEL光谱图的固有结构的启发，我们提出了一个新的语音转换框架，即用于语音转换的基于子带的生成对抗网络（SGAN-VC）。 SGAN-VC通过明确利用不同子带之间的空间特征分别转换源语音的每个子带内容。 SGAN-VC包含一个样式编码器，一个内容编码器和一个解码器。特别是，样式编码器网络旨在学习目标扬声器不同子带的样式代码。内容编码器网络可以捕获有关源语音的内容信息。最后，解码器生成特定的子带内容。此外，我们提出了一个换挡模块，以微调源扬声器的音调，从而使转换后的音调更加准确和解释。广泛的实验表明，所提出的方法在定性和定量上，无论是在可见的还是看不见的数据上，都可以在VCTK语料库和Aishell3数据集上实现最先进的性能。此外，SGAN-VC在看不见的数据上的内容清晰度甚至超过了ASR网络援助的StarganV2-VC。

Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题