论文标题
从说话者验证到多言扬声器语音综合,深层转移以及反馈约束
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint
论文作者
论文摘要
近年来,端到端文本到语音模型可以合成高保真的语音。但是,在文本到语音系统中访问和控制语音属性,例如说话者身份,韵律和情感,仍然是一个挑战。本文介绍了一个涉及多宣言语音综合反馈约束的系统。我们通过参与说话者验证网络来增强从说话者验证到语音综合的知识转移。约束是由与说话者身份相关的附加损失所采取的,该损失是为了提高综合语音与其自然参考音频之间的说话者相似性的集中。该模型在公开可用的数据集上进行了培训和评估。实验结果,包括对扬声器嵌入空间的可视化,在频谱图水平上的扬声器身份克隆方面显示出显着改善。合成的样本可在线使用。 (https://caizexin.github.io/mlspk-syn-samples/index.html)
High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraint for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added loss related to the speaker identity, which is centralized to improve the speaker similarity between the synthesized speech and its natural reference audio. The model is trained and evaluated on publicly available datasets. Experimental results, including visualization on speaker embedding space, show significant improvement in terms of speaker identity cloning in the spectrogram level. Synthesized samples are available online for listening. (https://caizexin.github.io/mlspk-syn-samples/index.html)