依赖于内容的细颗粒扬声器嵌入零拍式扬声器在文本到语音合成中的适应

论文标题

依赖于内容的细颗粒扬声器嵌入零拍式扬声器在文本到语音合成中的适应

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

论文作者

Zhou, Yixuan, Song, Changhe, Li, Xiang, Zhang, Luwen, Wu, Zhiyong, Bian, Yanyao, Su, Dan, Meng, Helen

论文摘要

零示出的扬声器适应性旨在克隆看不见的扬声器的声音，而无需任何适应时间和参数。先前的研究通常使用扬声器编码器从参考语音中提取嵌入全球固定扬声器，并且有几次尝试尝试了可变长度的扬声器嵌入。但是，他们忽略了转移与音素含量相关的个人发音特征，从而导致详细的口语风格和发音习惯的说话者相似性差。为了提高说话者编码器对个人发音特征进行建模的能力，我们建议依赖于内容的细粒扬声器嵌入零拍式扬声器适应器。相应的本地内容嵌入和说话者嵌入分别从参考语音中提取。引入了一个参考注意模块，而不是对时间关系进行建模，以模拟参考语音和输入文本之间的内容相关性，并为每个音素编码器输出生成细颗粒的扬声器嵌入。实验结果表明，我们提出的方法可以改善综合语音的扬声器相似性，尤其是对于看不见的说话者。

Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content, leading to poor speaker similarity in terms of detailed speaking styles and pronunciation habits. To improve the ability of the speaker encoder to model personal pronunciation characteristics, we propose content-dependent fine-grained speaker embedding for zero-shot speaker adaptation. The corresponding local content embeddings and speaker embeddings are extracted from a reference speech, respectively. Instead of modeling the temporal relations, a reference attention module is introduced to model the content relevance between the reference speech and the input text, and to generate the fine-grained speaker embedding for each phoneme encoder output. The experimental results show that our proposed method can improve speaker similarity of synthesized speeches, especially for unseen speakers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题