论文标题

视听扬声器诊断的自我监督学习

Self-supervised learning for audio-visual speaker diarization

论文作者

Ding, Yifan, Xu, Yong, Zhang, Shi-Xiong, Cong, Yahuan, Wang, Liqiang

论文摘要

说话者诊断是为了找到特定说话者的语音段,已被广泛用于以人为中心的应用程序,例如视频会议或人类计算机相互作用系统。在本文中,我们提出了一种自我监管的音频视频同步学习方法,以解决说话者诊断的问题而无需大规模的标签工作。我们通过引入两个新的损失函数来改善以前的方法:动态三重态损失和多项式损失。我们在现实世界中的人类交互系统上测试它们,结果表明,我们的最佳模型可产生 +8%F1-SCORESAS的显着增益以及降低诊断错误率。最后,我们引入了一种新的大型音频视频语料库,旨在填补中文的音频视频数据集的空缺。

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8%F1-scoresas well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video datasets in Chinese.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源