论文标题
目标扬声器提取的新见解
New Insights on Target Speaker Extraction
论文作者
论文摘要
发言人提取(SE)的目的是在辅助信息的帮助下将目标发言人的讲话与干扰者的混合物隔离开来。单渠道SE已采用了几种形式的辅助信息,例如目标扬声器的语音摘要或与口语话语相对应的视觉信息。通常通过将SE的提取性能与不知情的说话者分离(SS)方法进行比较,通常可以评估SE中辅助信息的有效性。遵循此评估方案,许多SE研究报告了与SS相比的性能提高,将其归因于辅助信息。但是,此类研究已经在几个数据集上进行,并且尚未考虑SS的最新深度神经网络体系结构,这些SS显示出了令人印象深刻的分离性能。在本文中,我们检查了SE中辅助信息在不同的输入方案和多个数据集中的作用。具体而言,我们使用一个通用框架将两个SE系统(基于音频和视频)的性能(基于音频和视频基于视频)的性能进行比较,该框架利用最近提出的双路循环神经网络作为主学习机器。在各种数据集上的实验评估表明,与未知的SS系统相比,在考虑的SE系统中使用辅助信息并不总是会提高提取性能。此外,在给出相同的混合物输入的情况下,我们提供了不同且扭曲的辅助信息时,我们可以洞悉SE系统的行为。
Speaker extraction (SE) aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information. Several forms of auxiliary information have been employed in single-channel SE, such as a speech snippet enrolled from the target speaker or visual information corresponding to the spoken utterance. The effectiveness of the auxiliary information in SE is typically evaluated by comparing the extraction performance of SE with uninformed speaker separation (SS) methods. Following this evaluation protocol, many SE studies have reported performance improvement compared to SS, attributing this to the auxiliary information. However, such studies have been conducted on a few datasets and have not considered recent deep neural network architectures for SS that have shown impressive separation performance. In this paper, we examine the role of the auxiliary information in SE for different input scenarios and over multiple datasets. Specifically, we compare the performance of two SE systems (audio-based and video-based) with SS using a common framework that utilizes the recently proposed dual-path recurrent neural network as the main learning machine. Experimental evaluation on various datasets demonstrates that the use of auxiliary information in the considered SE systems does not always lead to better extraction performance compared to the uninformed SS system. Furthermore, we offer insights into the behavior of the SE systems when provided with different and distorted auxiliary information given the same mixture input.