DropClass和DropAdapt：为深度扬声器表示学习的丢弃课程

论文标题

DropClass和DropAdapt：为深度扬声器表示学习的丢弃课程

DropClass and DropAdapt: Dropping classes for deep speaker representation learning

论文作者

Luu, Chau, Bell, Peter, Renals, Steve

论文摘要

许多有关深扬声器嵌入式的作品最近在大型分类任务上训练其功能提取网络，从而区分了培训集中的所有扬声器。从经验上讲，即使对于看不见的说话者，这也已证明会产生说话者歧视性的嵌入。但是，尚不清楚这是训练嵌入良好的最佳方法。这项工作提出了两种学习嵌入方法的方法，这是基于在培训期间放下课程的概念。我们证明，两种方法都可以在说话者验证任务中产生性能提高。第一个提出的方法DropClass通过定期从培训数据中删除类的随机子集和整个训练中的输出层，从而导致了对许多不同分类任务进行培训的功能提取器。结合添加性角缘损失，该方法在voxceleb上强基线的相同错误率（EER）的相对相对提高7.9％。第二种提出的方法DropAdapt是一种以无监督的方式将受过训练的模型调整为一组入学扬声器的方法。这是通过仅在使用注册扬声器作为输入时产生高概率预测的类模型来进行微调来执行的，并再次从输出层删除相关的行。该方法在Voxceleb上的EER相对改善产生了13.2％的相对改善。本文的代码已公开可用。

Many recent works on deep speaker embeddings train their feature extraction networks on large classification tasks, distinguishing between all speakers in a training set. Empirically, this has been shown to produce speaker-discriminative embeddings, even for unseen speakers. However, it is not clear that this is the optimal means of training embeddings that generalize well. This work proposes two approaches to learning embeddings, based on the notion of dropping classes during training. We demonstrate that both approaches can yield performance gains in speaker verification tasks. The first proposed method, DropClass, works via periodically dropping a random subset of classes from the training data and the output layer throughout training, resulting in a feature extractor trained on many different classification tasks. Combined with an additive angular margin loss, this method can yield a 7.9% relative improvement in equal error rate (EER) over a strong baseline on VoxCeleb. The second proposed method, DropAdapt, is a means of adapting a trained model to a set of enrolment speakers in an unsupervised manner. This is performed by fine-tuning a model on only those classes which produce high probability predictions when the enrolment speakers are used as input, again also dropping the relevant rows from the output layer. This method yields a large 13.2% relative improvement in EER on VoxCeleb. The code for this paper has been made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题