论文标题

妈妈:端到端语音到文本翻译的掩盖声学建模

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

论文作者

Chen, Junkun, Ma, Mingbo, Zheng, Renjie, Huang, Liang

论文摘要

直接将源语言语音转化为目标语言文本的端到端语音到文本翻译(E2E-ST)在实践中非常有用,但是传统的级联方法(ASR+MT)通常会在管道中出现错误传播。另一方面,现有的端到端解决方案在很大程度上取决于用自动语音识别(ASR)(ASR)进行预训练或多任务培训的源语言转录。相反,我们提出了一种简单的技术,以一种只有在语音方面的自我监督方式来学习强大的语音编码器,该语音方面可以使用语音数据而无需转录。该技术称为掩盖的声学建模(MAM),不仅为改进E2E-ST提供了替代解决方案,而且可以在没有注释的情况下对任何声学信号(包括非语音信号)进行预训练。我们在8个不同的翻译方向上进行实验。在不使用任何转录的情况下,我们的技术可实现+1.1 BLEU的平均改善,并且在MAM预训练中+2.3 BLEU。使用任意声学信号对MAM进行预训练也有+1.6 BLEU的平均改进。与ASR多任务学习解决方案相比,该解决方案在训练过程中回答了转录,我们的预训练的MAM模型(不使用转录)达到了相似的精度。

End-to-end Speech-to-text Translation (E2E-ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique termed Masked Acoustic Modeling (MAM), not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals (including non-speech ones) without annotation. We conduct our experiments over 8 different translation directions. In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training. Pre-training of MAM with arbitrary acoustic signals also has an average improvement with +1.6 BLEU for those languages. Compared with ASR multi-task learning solution, which replies on transcription during training, our pre-trained MAM model, which does not use transcription, achieves similar accuracy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源