论文标题
VQVC+:通过向量量化和U-NET体系结构进行单发语音转换
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture
论文作者
论文摘要
语音转换(VC)是一项任务,将源扬声器的音色,口音和音调转换为另一个人的音色,同时保留语言内容。这仍然是一项具有挑战性的工作,尤其是在一次性环境中。基于自动编码器的VC方法可以将说话者和输入语音中的内容删除,而无需鉴于说话者的身份,因此这些方法可以进一步推广到看不见的说话者。通过向量量化(VQ),对抗训练或实例归一化(IN)来实现截面能力。但是,不完美的分解可能会损害输出语音的质量。在这项工作中,为了进一步提高音频质量,我们在基于自动编码器的VC系统中使用U-NET体系结构。我们发现,要利用U-NET体系结构,需要强大的信息瓶颈。量化潜在向量的基于VQ的方法可以达到目的。目的和主观评估表明,所提出的方法在音频自然性和说话者的相似性方面表现良好。
Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting. Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity, so these methods can further generalize to unseen speakers. The disentangle capability is achieved by vector quantization (VQ), adversarial training, or instance normalization (IN). However, the imperfect disentanglement may harm the quality of output speech. In this work, to further improve audio quality, we use the U-Net architecture within an auto-encoder-based VC system. We find that to leverage the U-Net architecture, a strong information bottleneck is necessary. The VQ-based method, which quantizes the latent vectors, can serve the purpose. The objective and the subjective evaluations show that the proposed method performs well in both audio naturalness and speaker similarity.