在嘈杂的环境下，稳健扬声器识别的样本内变异性不变损失

论文标题

在嘈杂的环境下，稳健扬声器识别的样本内变异性不变损失

Within-sample variability-invariant loss for robust speaker recognition under noisy environments

论文作者

Cai, Danwei, Cai, Weicheng, Li, Ming

论文摘要

尽管深层神经网络能够实现扬声器识别的显着改善，但在嘈杂的环境下的性能不令人满意。在本文中，我们训练扬声器嵌入网络，以学习嘈杂话语的“干净”嵌入。具体而言，对网络进行了原始的扬声器识别损失训练，并具有辅助性内部可变性损失。这种辅助可变性不变的损失用于学习清洁话语及其嘈杂副本之间的相同嵌入，并防止网络将不希望的噪音或变异性编码到说话者表示中。此外，我们研究了即时生成清洁和嘈杂的话语对的数据准备策略。该策略在每个训练步骤中为相同的清洁话语产生不同的嘈杂副本，从而帮助扬声器嵌入网络在嘈杂的环境下更好地推广。 Voxceleb1上的实验表明，所提出的训练框架在干净和嘈杂的条件下提高了说话者验证系统的性能。

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and its noisy copies and prevents the network from encoding the undesired noises or variabilities into the speaker representation. Furthermore, we investigate the data preparation strategy for generating clean and noisy utterance pairs on-the-fly. The strategy generates different noisy copies for the same clean utterance at each training step, helping the speaker embedding network generalize better under noisy environments. Experiments on VoxCeleb1 indicate that the proposed training framework improves the performance of the speaker verification system in both clean and noisy conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题