实时目标声音提取

论文标题

实时目标声音提取

Real-Time Target Sound Extraction

论文作者

Veluri, Bandhav, Chan, Justin, Itani, Malek, Chen, Tuochao, Yoshioka, Takuya, Gollakota, Shyamnath

论文摘要

我们提出了第一个实现实时和流式目标声音提取的神经网络模型。为此，我们提出了波形器，波形器是一种用作编码器的扩张因果卷积层的编码器架构，而变压器解码器层则是解码器。该混合体系结构使用扩张的因果卷积以计算有效的方式处理大型接受场，同时还利用了基于变压器的架构的概括性能。与此任务的先前模型相比，我们的评估显示SI-SNRI的2.2-3.3 dB改进，同时具有1.2-4x较小的型号尺寸和降低1.5-2倍的运行时。我们提供代码，数据集和音频样本：https：//waveformer.cs.washington.edu/。

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题