野外演讲者诊断的晚期视听融合

论文标题

野外演讲者诊断的晚期视听融合

Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

论文作者

Pan, Zexu, Wichern, Gordon, Germain, François G., Subramanian, Aswin, Roux, Jonathan Le

论文摘要

扬声器的诊断是有限的音频研究的，但对于挑战野外视频的探索很少，这些视频具有更多的扬声器，较短的话语，并且在屏幕上的扬声器不一致。我们通过提出一个视听性诊断模型来解决这一差距，该模型通过晚期融合结合了只有音频和以视觉为中心的子系统。对于音频，我们表明，基于吸引子的端到端系统（EEND-EDA）在接受我们建议的模拟代理数据集的培训时表现出色，并提出了改进的版本，Eend-Eda ++，它在解码中使用了EEND-EDA ++，并且在培训过程中使用扬声器识别损失，以更好地处理更多的演讲者。以视觉为中心的子系统利用面部属性和唇彩同步，用于对屏幕扬声器的身份和语音活动估计。这两个子系统都超过了艺术状态（SOTA），融合了视听系统在AVA-AVD基准上实现了新的SOTA。

Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题