论文标题
文本驱动的任意声音分离
Text-Driven Separation of Arbitrary Sounds
论文作者
论文摘要
我们提出了一种根据文本描述或目标源的简短音频样本将所需声源与单渠道混合物分开的方法。这是通过结合两个不同模型来实现的。第一个模型,音词,经过训练,可以将音频剪辑及其文本描述共同嵌入到共享表示中的同一嵌入中。第二个模型Soundfilter将混合的源音频剪辑作为输入,并根据条件向量与由配音词定义的共享文本审计表示形式将其分开,从而使模型不可思议的是条件模式。在评估多个数据集上,我们表明我们的方法可以实现9.1 dB的SI-SDR,用于在文本上进行调节时两种任意声音的混合物,在音频条件下进行10.1 dB。我们还表明,配音词在学习共同体方面有效,并且我们的多模式训练方法可以提高声音的性能。
We propose a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source. This is achieved by combining two distinct models. The first model, SoundWords, is trained to jointly embed both an audio clip and its textual description to the same embedding in a shared representation. The second model, SoundFilter, takes a mixed source audio clip as an input and separates it based on a conditioning vector from the shared text-audio representation defined by SoundWords, making the model agnostic to the conditioning modality. Evaluating on multiple datasets, we show that our approach can achieve an SI-SDR of 9.1 dB for mixtures of two arbitrary sounds when conditioned on text and 10.1 dB when conditioned on audio. We also show that SoundWords is effective at learning co-embeddings and that our multi-modal training approach improves the performance of SoundFilter.