语音手势从文本，音频和说话者身份的三峰环境产生

论文标题

语音手势从文本，音频和说话者身份的三峰环境产生

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity

论文作者

Yoon, Youngwoo, Cha, Bok, Lee, Joo-Haeng, Jang, Minsu, Lee, Jaeyeon, Kim, Jaehong, Lee, Geehyuk

论文摘要

对于包括虚拟化身和社会机器人在内的人类般的代理人，在讲话时做出适当的手势对于人类的互动至关重要。共同语音的手势可以增强互动体验，并使代理商看起来还活着。但是，由于对人们的手势缺乏了解，很难产生类似人类的手势。数据驱动的方法试图从人类的示威中学习手势技巧，但是手势的模棱两可和个体本质阻碍了学习。在本文中，我们提出了一种自动手势生成模型，该模型使用语音文本，音频和说话者身份的多模式上下文来可靠地产生手势。通过合并多模式的环境和对抗训练方案，提出的模型输出了类似于人类的手势，并且与语音内容和节奏匹配。我们还为手势生成模型引入了新的定量评估度量。引入的度量和主观人类评估的实验表明，提出的手势生成模型比现有的端到端生成模型更好。我们进一步确认，我们的模型能够在上下文受到限制的情况下与合成的音频一起使用，并证明可以通过在样式嵌入空间中指定不同扬声器视频中学到的不同扬声器身份来生成相同语音的不同手势样式。所有代码和数据均可在https://github.com/ai4r/gesture-generation-from-trimodal-context上获得。

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human--agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题