语音：TTS中的个性化语音生成

论文标题

语音：TTS中的个性化语音生成

VoiceMe: Personalized voice generation in TTS

论文作者

van Rijn, Pol, Mertes, Silvan, Schiller, Dominik, Dura, Piotr, Siuzdak, Hubert, Harrison, Peter M. C., André, Elisabeth, Jacoby, Nori

论文摘要

新颖的文本到语音系统可以产生全新的声音，这些声音在训练过程中未见。但是，从高维扬声器空间中有效地创建个性化的声音仍然是一项艰巨的任务。在这项工作中，我们使用了来自最先进的扬声器验证模型（Speakernet）的扬声器嵌入，接受了数千名扬声器的培训来调节TTS模型。我们采用人类的抽样范式来探索这个扬声器潜在空间。我们表明，用户可以创造出适合面孔，艺术肖像和卡通照片的声音。我们招募在线参与者集体操纵说话的面孔的声音。我们表明（1）单独的人类评估者确认创造的声音与面部相匹配，（2）扬声器的性别在声音中得到了很好的反射，（3）人们一直在朝着给定的面孔朝着真实的配音原型迈进。我们的结果表明，该技术可以应用于广泛的应用程序中，包括在有声读物和游戏中的角色语音开发，个性化的语音助手以及对有言语障碍的人的个人声音。

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high-dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题