妈妈：端到端语音到文本翻译的掩盖声学建模

论文标题

妈妈：端到端语音到文本翻译的掩盖声学建模

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

论文作者

Chen, Junkun, Ma, Mingbo, Zheng, Renjie, Huang, Liang

论文摘要

直接将源语言语音转化为目标语言文本的端到端语音到文本翻译（E2E-ST）在实践中非常有用，但是传统的级联方法（ASR+MT）通常会在管道中出现错误传播。另一方面，现有的端到端解决方案在很大程度上取决于用自动语音识别（ASR）（ASR）进行预训练或多任务培训的源语言转录。相反，我们提出了一种简单的技术，以一种只有在语音方面的自我监督方式来学习强大的语音编码器，该语音方面可以使用语音数据而无需转录。该技术称为掩盖的声学建模（MAM），不仅为改进E2E-ST提供了替代解决方案，而且可以在没有注释的情况下对任何声学信号（包括非语音信号）进行预训练。我们在8个不同的翻译方向上进行实验。在不使用任何转录的情况下，我们的技术可实现+1.1 BLEU的平均改善，并且在MAM预训练中+2.3 BLEU。使用任意声学信号对MAM进行预训练也有+1.6 BLEU的平均改进。与ASR多任务学习解决方案相比，该解决方案在训练过程中回答了转录，我们的预训练的MAM模型（不使用转录）达到了相似的精度。

End-to-end Speech-to-text Translation (E2E-ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique termed Masked Acoustic Modeling (MAM), not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals (including non-speech ones) without annotation. We conduct our experiments over 8 different translation directions. In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training. Pre-training of MAM with arbitrary acoustic signals also has an average improvement with +1.6 BLEU for those languages. Compared with ASR multi-task learning solution, which replies on transcription during training, our pre-trained MAM model, which does not use transcription, achieves similar accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题