论文标题
野外演讲者诊断的晚期视听融合
Late Audio-Visual Fusion for In-The-Wild Speaker Diarization
论文作者
论文摘要
扬声器的诊断是有限的音频研究的,但对于挑战野外视频的探索很少,这些视频具有更多的扬声器,较短的话语,并且在屏幕上的扬声器不一致。我们通过提出一个视听性诊断模型来解决这一差距,该模型通过晚期融合结合了只有音频和以视觉为中心的子系统。对于音频,我们表明,基于吸引子的端到端系统(EEND-EDA)在接受我们建议的模拟代理数据集的培训时表现出色,并提出了改进的版本,Eend-Eda ++,它在解码中使用了EEND-EDA ++,并且在培训过程中使用扬声器识别损失,以更好地处理更多的演讲者。以视觉为中心的子系统利用面部属性和唇彩同步,用于对屏幕扬声器的身份和语音活动估计。这两个子系统都超过了艺术状态(SOTA),融合了视听系统在AVA-AVD基准上实现了新的SOTA。
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.