论文标题
音梁:目标声音提取以声音级标签和入学线索为条件,以提高性能和持续学习
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
论文作者
论文摘要
在许多情况下,我们希望听到所需的声音事件(SES),同时忽略干扰。目标声音提取(TSE)通过在抑制所有其他声音的同时估算目标SE类声音的音频信号来解决此问题。我们可以通过神经网络来实现这一目标,该神经网络通过将目标SES提取到代表目标SE类的线索中。已经提出了两种类型的线索,即目标SE类标签和注册音频样本(或音频查询),它们是来自目标SE类的声音的预录音频样本。基于SE类标签的系统可以直接优化代表SE类的嵌入向量,从而导致高萃取性能。但是,扩展这些系统以提取培训期间未遇到的新SE类并不容易。基于注册的方法通过在混合物中找到与注册音频样本相似的混合物中的声音来提取SES。这些方法不明确依赖SE类定义,因此可以处理新的SE类。在本文中,我们介绍了一个TSE框架Soundbeam,该框架结合了两种方法的优势。我们还使用合成和真实的混合物对不同的TSE方案进行了广泛的评估,这显示了Soundbeam的潜力。
In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing the target SE classes. Two types of clues have been proposed, i.e., target SE class labels and enrollment audio samples (or audio queries), which are pre-recorded audio samples of sounds from the target SE classes. Systems based on SE class labels can directly optimize embedding vectors representing the SE classes, resulting in high extraction performance. However, extending these systems to extract new SE classes not encountered during training is not easy. Enrollment-based approaches extract SEs by finding sounds in the mixtures that share similar characteristics to the enrollment audio samples. These approaches do not explicitly rely on SE class definitions and can thus handle new SE classes. In this paper, we introduce a TSE framework, SoundBeam, that combines the advantages of both approaches. We also perform an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of SoundBeam.