视听扬声器诊断的自我监督学习

论文标题

视听扬声器诊断的自我监督学习

Self-supervised learning for audio-visual speaker diarization

论文作者

Ding, Yifan, Xu, Yong, Zhang, Shi-Xiong, Cong, Yahuan, Wang, Liqiang

论文摘要

说话者诊断是为了找到特定说话者的语音段，已被广泛用于以人为中心的应用程序，例如视频会议或人类计算机相互作用系统。在本文中，我们提出了一种自我监管的音频视频同步学习方法，以解决说话者诊断的问题而无需大规模的标签工作。我们通过引入两个新的损失函数来改善以前的方法：动态三重态损失和多项式损失。我们在现实世界中的人类交互系统上测试它们，结果表明，我们的最佳模型可产生 +8％F1-SCORESAS的显着增益以及降低诊断错误率。最后，我们引入了一种新的大型音频视频语料库，旨在填补中文的音频视频数据集的空缺。

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8%F1-scoresas well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video datasets in Chinese.

下载PDF全文

下载文献需遵守相关版权规定

论文标题