Conv-Tasnet的实证研究

论文标题

Conv-Tasnet的实证研究

An empirical study of Conv-TasNet

论文作者

Kadioglu, Berkan, Horgan, Michael, Liu, Xiaoyu, Pons, Jordi, Darcy, Dan, Kumar, Vivek

论文摘要

Conv-Tasnet是最近提出的基于波形的深神经网络，在语音源分离中实现了最先进的性能。它的体系结构由可学习的编码器/解码器和一个在这个学习空间之上运行的分离器组成。已经提出了各种改进来进行交流。但是，它们主要集中在分离器上，将其编码器/解码器作为（浅）线性操作员。在本文中，我们对Conv-Tasnet进行了经验研究，并提出了基于IT（深）非线性变体的编码器/解码器的增强。此外，我们尝试了更大，更多样化的库数据集，并在更大的数据集中培训研究模型的概括能力。我们提出了跨数据库评估，其中包括评估与WSJ0-2MIX，LIBLITTS和VCTK数据库的分离。我们的结果表明，对编码器/解码器的增强功能可以提高平均SI-SNR性能超过1 dB。此外，我们提供了有关Conv-Tasnet的概括能力以及对编码器/解码器改进的潜在价值的见解。

Conv-TasNet is a recently proposed waveform-based deep neural network that achieves state-of-the-art performance in speech source separation. Its architecture consists of a learnable encoder/decoder and a separator that operates on top of this learned space. Various improvements have been proposed to Conv-TasNet. However, they mostly focus on the separator, leaving its encoder/decoder as a (shallow) linear operator. In this paper, we conduct an empirical study of Conv-TasNet and propose an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it. In addition, we experiment with the larger and more diverse LibriTTS dataset and investigate the generalization capabilities of the studied models when trained on a much larger dataset. We propose cross-dataset evaluation that includes assessing separations from the WSJ0-2mix, LibriTTS and VCTK databases. Our results show that enhancements to the encoder/decoder can improve average SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the generalization capabilities of Conv-TasNet and the potential value of improvements to the encoder/decoder.

下载PDF全文

下载文献需遵守相关版权规定

论文标题