Ask2mask：蒙版语音建模的指导数据选择

论文标题

Ask2mask：蒙版语音建模的指导数据选择

Ask2Mask: Guided Data Selection for Masked Speech Modeling

论文作者

Baskar, Murali Karthick, Rosenberg, Andrew, Ramabhadran, Bhuvana, Zhang, Yu, Moreno, Pedro

论文摘要

蒙版的语音建模（MSM）方法，例如WAV2VEC2或W2V-BERT，可以在语音中随机掩盖的语音框架学习表示。尽管这些方法提高了自动语音识别（ASR）系统的性能，但它们具有一个主要限制。他们以同等的重量对待所有无监督的语音样本，这阻碍了学习并非所有样本都有相关信息来学习有意义的表示。在这项工作中，我们解决了这一限制。我们提出了Ask2mask（ATM），这是一种在MSM预训练期间专注于特定样品的新方法。 ATM采用外部ASR模型或\ textIt {Scorer}以两种不同的方式对无监督的输入样本进行权重：1）通过掩盖评分者选择的高度自信的输入帧，可以执行细粒度的数据选择。这使模型可以学习有意义的表示。 2）ATM进一步扩展到通过用语级信心得分加权最终的MSM损失，从而进一步扩展到焦点。我们在两个基准的库中进行微调实验：librispeech（与培训前数据相匹配）和常见视觉，TED-LIUM，AMI和CHIME-6（与训练前数据不匹配）。该结果证实了ATM对在不匹配的条件下显着提高识别性能的功效（比公开的结果相对11.6 \％相对相对11.6 \％，而相对基线的相对相对4.46 \％），而在匹配条件下仍会产生适度的改进。

Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or \textit{scorer} to weight unsupervised input samples in two different ways: 1) A fine-grained data selection is performed by masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. 2) ATM is further extended to focus at utterance-level by weighting the final MSM loss with the utterance-level confidence score. We conduct fine-tuning experiments on two well-benchmarked corpora: LibriSpeech (matching the pre-training data) and Commonvoice, TED-LIUM, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions (up to 11.6\% relative over published results and upto 4.46\% relative over our internal baseline) while still yielding modest improvements under matched conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题