说话者识别的跨时间延迟神经网络

论文标题

说话者识别的跨时间延迟神经网络

Crossed-Time Delay Neural Network for Speaker Recognition

论文作者

Chen, Liang, Liang, Yanchun, Shi, Xiaohu, Zhou, You, Wu, Chunguo

论文摘要

时间延迟神经网络（TDNN）是基于DNN的说话者识别系统的表现良好的结构。在本文中，我们引入了一种新型结构跨时间延迟神经网络（CTDNN），以增强当前TDNN的性能。受到卷积神经网络卷积层的多滤波器设置的启发，我们设置了多个时间延迟单元在底层处有不同的上下文大小，并构造了多层并行网络。拟议的CTDNN在说话者验证和识别任务上对原始TDNN进行了重大改进。在验证实验中，它在Voxceleb1数据集中的表现优于2.6％的绝对误差率提高。在少数拍摄条件下，CTDNN达到90.4％的识别精度，这使原始TDNN的识别精度翻了一番。我们还将提出的CTDNN与TDNN FTDNN的另一种新变体进行了比较，该版本表明我们的模型在很少的镜头条件下具有36％的绝对识别精度提高，并且可以更好地处理在较短的训练时间中对较大批次的训练，从而更好地利用了计算资源。新型号的代码在https://github.com/chenllliang/ctdnn上发布

Time Delay Neural Network (TDNN) is a well-performing structure for DNN-based speaker recognition systems. In this paper we introduce a novel structure Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN. Inspired by the multi-filters setting of convolution layer from convolution neural network, we set multiple time delay units each with different context size at the bottom layer and construct a multilayer parallel network. The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks. It outperforms in VoxCeleb1 dataset in verification experiment with a 2.6% absolute Equal Error Rate improvement. In few shots condition CTDNN reaches 90.4% identification accuracy, which doubles the identification accuracy of original TDNN. We also compare the proposed CTDNN with another new variant of TDNN, FTDNN, which shows that our model has a 36% absolute identification accuracy improvement under few shots condition and can better handle training of a larger batch in a shorter training time, which better utilize the calculation resources. The code of the new model is released at https://github.com/chenllliang/CTDNN

下载PDF全文

下载文献需遵守相关版权规定

论文标题