论文标题

Conv-Tasnet的实证研究

An empirical study of Conv-TasNet

论文作者

Kadioglu, Berkan, Horgan, Michael, Liu, Xiaoyu, Pons, Jordi, Darcy, Dan, Kumar, Vivek

论文摘要

Conv-Tasnet是最近提出的基于波形的深神经网络,在语音源分离中实现了最先进的性能。它的体系结构由可学习的编码器/解码器和一个在这个学习空间之上运行的分离器组成。已经提出了各种改进来进行交流。但是,它们主要集中在分离器上,将其编码器/解码器作为(浅)线性操作员。在本文中,我们对Conv-Tasnet进行了经验研究,并提出了基于IT(深)非线性变体的编码器/解码器的增强。此外,我们尝试了更大,更多样化的库数据集,并在更大的数据集中培训研究模型的概括能力。我们提出了跨数据库评估,其中包括评估与WSJ0-2MIX,LIBLITTS和VCTK数据库的分离。我们的结果表明,对编码器/解码器的增强功能可以提高平均SI-SNR性能超过1 dB。此外,我们提供了有关Conv-Tasnet的概括能力以及对编码器/解码器改进的潜在价值的见解。

Conv-TasNet is a recently proposed waveform-based deep neural network that achieves state-of-the-art performance in speech source separation. Its architecture consists of a learnable encoder/decoder and a separator that operates on top of this learned space. Various improvements have been proposed to Conv-TasNet. However, they mostly focus on the separator, leaving its encoder/decoder as a (shallow) linear operator. In this paper, we conduct an empirical study of Conv-TasNet and propose an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it. In addition, we experiment with the larger and more diverse LibriTTS dataset and investigate the generalization capabilities of the studied models when trained on a much larger dataset. We propose cross-dataset evaluation that includes assessing separations from the WSJ0-2mix, LibriTTS and VCTK databases. Our results show that enhancements to the encoder/decoder can improve average SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the generalization capabilities of Conv-TasNet and the potential value of improvements to the encoder/decoder.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源