论文标题
使用混合物不变训练的无监督声音分离
Unsupervised Sound Separation Using Mixture Invariant Training
论文作者
论文摘要
近年来,使用对深度神经网络的监督培训,在单渠道声音分离的问题上取得了快速的进步。在这种监督的方法中,训练了模型,以预测通过添加孤立的地面真实源创建的合成混合物中的组成源。依赖此合成训练数据是有问题的,因为良好的性能取决于训练数据和现实世界音频之间的匹配程度,尤其是在声学条件和来源的分布方面。声学特性可能具有挑战性地模拟,声音类型的分布可能很难复制。在本文中,我们提出了一种完全无监督的方法,即混合训练(混合),仅需要单通道声音混合物。在混合装置中,训练示例是通过将现有混合物混合在一起的,将它们分离成可变数量的潜在来源,以便可以将分离的来源进行混合以近似原始混合物。我们表明,与语音分离的监督方法相比,混音可以实现竞争性能。在半监督的学习设置中使用Mixit可以实现无监督的域适应性,并在没有地面真实源波形的情况下从大量现实世界数据中学习。特别是,我们通过结合混合物,训练嘈杂混合物的语音增强系统,并通过合并大量的野外数据来显着改善演讲分离性能,并改善通用声音分离。
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.