分析扬声器验证的深度功能损失增强

论文标题

分析扬声器验证的深度功能损失增强

Analysis of Deep Feature Loss based Enhancement for Speaker Verification

论文作者

Kataria, Saurabh, Nidadavolu, Phani Sankar, Villalba, Jesús, Dehak, Najim

论文摘要

数据增强通常用于在说话者验证系统中注入鲁棒性。最近有几个组织的挑战集中在处理新颖的声学环境上。基于深度学习的语音增强是现代解决方案。最近，一项研究提议优化预训练辅助网络的激活空间中的增强网络。这种称为“深度功能损失”的方法比最先进的基于X-vector的系统在儿童语音数据集（称为BabyTrain）上得到了极大的改善。这项工作分析了该方法的各个方面，并在这种情况下提出了一些新的问题。我们首先搜索最佳数量的辅助网络激活，训练数据和增强功能维度。实验揭示了我们采用的信噪比过滤的重要性，我们为增强网络训练创建了大型，清洁和自然主义的语料库。为了解决增强中的“不匹配”问题，我们发现增强前端（X-Vector网络）数据有助于后端（概率线性判别分析（PLDA））有害。重要的是，我们发现增强的信号包含与原始的互补信息。通过将它们结合在前端中，这与基线相对相对改善约为40％。我们还进行了一项消融研究，以从X-Vector数据增强中删除噪声类别，对于此类系统，无论在训练过程中是否看到了噪声类别本身。最后，我们设计了几种替代方案，以结论该任务的深度功能损失方案的无效性。

Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trained auxiliary network. This methodology, called deep feature loss, greatly improved over the state-of-the-art conventional x-vector based system on a children speech dataset called BabyTrain. This work analyzes various facets of that approach and asks few novel questions in that context. We first search for optimal number of auxiliary network activations, training data, and enhancement feature dimension. Experiments reveal the importance of Signal-to-Noise Ratio filtering that we employ to create a large, clean, and naturalistic corpus for enhancement network training. To counter the "mismatch" problem in enhancement, we find enhancing front-end (x-vector network) data helpful while harmful for the back-end (Probabilistic Linear Discriminant Analysis (PLDA)). Importantly, we find enhanced signals contain complementary information to original. Established by combining them in front-end, this gives ~40% relative improvement over the baseline. We also do an ablation study to remove a noise class from x-vector data augmentation and, for such systems, we establish the utility of enhancement regardless of whether it has seen that noise class itself during training. Finally, we design several dereverberation schemes to conclude ineffectiveness of deep feature loss enhancement scheme for this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题