使用扬声器嵌入来调整自我监管的模型，以识别多对话的语音识别

论文标题

使用扬声器嵌入来调整自我监管的模型，以识别多对话的语音识别

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

论文作者

Huang, Zili, Raj, Desh, García, Paola, Khudanpur, Sanjeev

论文摘要

在没有明确监督的情况下学习数据表示的自学学习方法（SSL）方法已在语音处理任务中越来越受欢迎，尤其是对于单聊天器应用程序。但是，这些模型通常会降低对多对话的场景（可能是由于域失配所致）的性能，从而严重限制了它们在此类应用程序中的使用。在本文中，我们研究了在两个条件下上游SSL模型对多对词器自动语音识别（ASR）任务的适应。首先，当给出分段的话语时，我们表明，基于注册嵌入的添加目标扬声器提取（TSE）模块与混合感知的预训练是互补的。其次，对于未分割的混合物，我们提出了一种新型的关节扬声器建模（JSM）方法，该方法通过其嵌入来汇总所有说话者中所有说话者的信息。通过在Libri2Mix上进行的受控实验，我们表明，使用扬声器嵌入式的相对改善，分别针对分段和未分段情况的强基础分别为9.1％和42.1％。我们还通过在AMI数据集上的实验来证明我们模型对真实对话混合物的有效性。

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题