论文标题
学习明确的韵律模型和深层扬声器嵌入非典型语音转换
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion
论文作者
论文摘要
尽管典型语音的语音转换(VC)取得了重大进展,例如非典型语音的VC,例如违规和第二语言和第二语言(L2)演讲,这仍然是一个挑战,因为它涉及在维持扬声器身份的同时纠正非典型韵律。为了解决这个问题,我们提出了一个具有明确的韵律建模和深层嵌入(DSE)学习的VC系统。首先,语音编码器致力于从非典型语音中提取强大的音素嵌入。其次,一个韵律校正器将音素嵌入以推断典型的音素持续时间和音高值。第三,转换模型将音素嵌入和典型的韵律特征作为输入,以生成转换后的语音,该语音以扬声器编码器或扬声器的适应为条件。广泛的实验表明,扬声器适应性可以实现更高的扬声器相似性,而基于扬声器编码器的转换模型可以大大减少语音清晰度提高的违规障碍和非本地发音模式。原始违反语音和转换语音之间语音识别结果的比较表明,可以实现47.6%的字符错误率(CER)和29.3%的单词错误率(WER)的绝对降低。
Though significant progress has been made for the voice conversion (VC) of typical speech, VC for atypical speech, e.g., dysarthric and second-language (L2) speech, remains a challenge, since it involves correcting for atypical prosody while maintaining speaker identity. To address this issue, we propose a VC system with explicit prosodic modelling and deep speaker embedding (DSE) learning. First, a speech-encoder strives to extract robust phoneme embeddings from atypical speech. Second, a prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation. Extensive experiments demonstrate that speaker adaptation can achieve higher speaker similarity, and the speaker encoder based conversion model can greatly reduce dysarthric and non-native pronunciation patterns with improved speech intelligibility. A comparison of speech recognition results between the original dysarthric speech and converted speech show that absolute reduction of 47.6% character error rate (CER) and 29.3% word error rate (WER) can be achieved.