使用典型动量学习，独立于文本独立的扬声器验证自我监督的扬声器验证

论文标题

使用典型动量学习，独立于文本独立的扬声器验证自我监督的扬声器验证

Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning

论文作者

Xia, Wei, Zhang, Chunlei, Weng, Chao, Yu, Meng, Yu, Dong

论文摘要

在这项研究中，我们研究了对说话者验证（SV）的自我监督的表示学习。首先，我们使用动量对比（MOCO）学习框架检查了一种简单的对比学习方法（SIMCLR），其中MoCo扬声器嵌入系统使用队列来维持大量的负面示例。我们表明，可以通过动量对比学习来学习更好的说话者嵌入。接下来，探索了替代性增强策略，以使同一语音话语中两个随机段的外部扬声器变化标准化。具体而言，波形中的增强很大程度上可以改善SV任务的扬声器表示。当引入原型记忆库时，提出的MoCo扬声器嵌入进一步改善，这鼓励说话者嵌入更接近其分配的原型，并具有中间的聚类步骤。此外，我们将自我监督的框架推广到半监督的方案，其中只有一小部分数据被标记。 Voxceleb数据集的全面实验表明，与现有技术相比，我们提出的自我监管方法可以实现竞争性能，并且可以通过部分标记的数据来处理完全监督的结果。

In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrastive learning. Next, alternative augmentation strategies are explored to normalize extrinsic speaker variabilities of two random segments from the same speech utterance. Specifically, augmentation in the waveform largely improves the speaker representations for SV tasks. The proposed MoCo speaker embedding is further improved when a prototypical memory bank is introduced, which encourages the speaker embeddings to be closer to their assigned prototypes with an intermediate clustering step. In addition, we generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled. Comprehensive experiments on the Voxceleb dataset demonstrate that our proposed self-supervised approach achieves competitive performance compared with existing techniques, and can approach fully supervised results with partially labeled data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题