论文标题
样本,翻译,重组:利用端到端语音翻译中的音频一致性来增加数据
Sample, Translate, Recombine: Leveraging Audio Alignments for Data Augmentation in End-to-end Speech Translation
论文作者
论文摘要
端到端的语音翻译依赖于将源语言输入与相应翻译与目标语言相对的数据。众所周知,此类数据是稀缺的,从而通过反翻译或知识蒸馏使综合数据扩大成为端到端培训的必要成分。在本文中,我们提出了一种新型的数据增强方法,该方法利用音频一致性,语言特性和翻译。首先,我们通过从存储文本和音频数据的后缀内存中采样来增强转录。其次,我们翻译了增强的成绩单。最后,我们重组串联音段和生成的翻译。除了培训MT系统外,我们仅使用基本的现成组件而无需微调。尽管资源与知识蒸馏相似,但添加我们的方法在Covost 2上的五对语言对和Europarl-ST上的两种语言对上的一致提高了0.9和1.1 BLEU点。
End-to-end speech translation relies on data that pair source-language speech inputs with corresponding translations into a target language. Such data are notoriously scarce, making synthetic data augmentation by back-translation or knowledge distillation a necessary ingredient of end-to-end training. In this paper, we present a novel approach to data augmentation that leverages audio alignments, linguistic properties, and translation. First, we augment a transcription by sampling from a suffix memory that stores text and audio data. Second, we translate the augmented transcript. Finally, we recombine concatenated audio segments and the generated translation. Besides training an MT-system, we only use basic off-the-shelf components without fine-tuning. While having similar resource demands as knowledge distillation, adding our method delivers consistent improvements of up to 0.9 and 1.1 BLEU points on five language pairs on CoVoST 2 and on two language pairs on Europarl-ST, respectively.