通过语音链重建和自转录的半监督序列到序列的一致性训练改进了一致性训练

论文标题

通过语音链重建和自转录的半监督序列到序列的一致性训练改进了一致性训练

Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing

论文作者

Qi, Heli, Novitasari, Sashi, Sakti, Sakriani, Nakamura, Satoshi

论文摘要

一致性正则化已应用于半监督序列到序列（S2S）自动语音识别（ASR）。该原理鼓励ASR模型对具有不同扰动的相同输入语音输出相似的预测。半监督的S2S ASR的现有范式将规范用作数据增强，并且需要静态教师模型来生成用于未转录语音的伪成绩单。但是，该范式无法充分利用一致性正则化。首先，规范的掩盖操作可能会损害演讲的语言内容，从而影响伪标签的质量。其次，S2S ASR需要输入语音和前缀令牌才能进行下一个预测。离线教师模型制作的静态前缀令牌在一致性训练过程中无法匹配动态伪标签。在这项工作中，我们提出了改进的半监督S2S ASR的一致性培训范式。我们利用语音链重建作为弱扩展来产生高质量的伪标签。此外，我们证明了学生ASR模型产生的动态伪成绩单有益于一致性训练。 LjSpeech和Librispeech Corpora的实验表明，与受监督的基线相比，我们的改进范式在单扬声器设置中可提高12.2％的CER，在多演讲者设置中获得38.6％的范围。

Consistency regularization has recently been applied to semi-supervised sequence-to-sequence (S2S) automatic speech recognition (ASR). This principle encourages an ASR model to output similar predictions for the same input speech with different perturbations. The existing paradigm of semi-supervised S2S ASR utilizes SpecAugment as data augmentation and requires a static teacher model to produce pseudo transcripts for untranscribed speech. However, this paradigm fails to take full advantage of consistency regularization. First, the masking operations of SpecAugment may damage the linguistic contents of the speech, thus influencing the quality of pseudo labels. Second, S2S ASR requires both input speech and prefix tokens to make the next prediction. The static prefix tokens made by the offline teacher model cannot match dynamic pseudo labels during consistency training. In this work, we propose an improved consistency training paradigm of semi-supervised S2S ASR. We utilize speech chain reconstruction as the weak augmentation to generate high-quality pseudo labels. Moreover, we demonstrate that dynamic pseudo transcripts produced by the student ASR model benefit the consistency training. Experiments on LJSpeech and LibriSpeech corpora show that compared to supervised baselines, our improved paradigm achieves a 12.2% CER improvement in the single-speaker setting and 38.6% in the multi-speaker setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题