论文标题
HIFI-VC:高质量的基于ASR的语音转换
HiFi-VC: High Quality ASR-Based Voice Conversion
论文作者
论文摘要
语音转换的目的(VC)是转换输入语音以匹配目标扬声器的声音,同时保持文本和韵律完整。 VC通常用于娱乐和语言AID系统,并用于语音数据生成和增强。研究人员和行业都特别感兴趣的是,任何一对一的风险投资系统都能在模型培训期间产生不见的声音。尽管最近取得了进展,但任何一种转换质量仍然不如自然语音。 在这项工作中,我们提出了一条新的一对一语音转换管道。我们的方法使用自动语音识别(ASR)功能,音高跟踪和最新的波形预测模型。根据多种主观和客观评估,我们的方法在语音质量,相似性和一致性方面优于现代基准。
The goal of voice conversion (VC) is to convert input voice to match the target speaker's voice while keeping text and prosody intact. VC is usually used in entertainment and speaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any VC systems, which are capable of generating voices unseen during model training, is of particular interest to both researchers and the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech. In this work, we propose a new any-to-any voice conversion pipeline. Our approach uses automated speech recognition (ASR) features, pitch tracking, and a state-of-the-art waveform prediction model. According to multiple subjective and objective evaluations, our method outperforms modern baselines in terms of voice quality, similarity and consistency.