弥合培训和推断无监督神经机器翻译之间的数据差距

论文标题

弥合培训和推断无监督神经机器翻译之间的数据差距

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

论文作者

He, Zhiwei, Wang, Xing, Wang, Rui, Shi, Shuming, Tu, Zhaopeng

论文摘要

反向翻译是无监督神经机器翻译（UNMT）的关键组成部分，它从目标单语数据中生成伪并行数据。 UNMT模型对具有翻译源的伪平行数据进行了训练，并在推理中翻译了自然源句子。训练和推理之间的来源差异阻碍了UNMT模型的翻译性能。通过仔细设计实验，我们在来源中确定了数据差距的两个代表性特征：（1）样式差距（即翻译的与自然文本样式），导致概括能力差；（2）诱导模型产生幻觉内容偏向目标语言的内容差距。为了缩小数据差距，我们提出了一种在线自我训练方法，该方法同时使用伪并行数据{自然源，转换目标}来模仿推理方案。几种广泛使用的语言对的实验结果表明，我们的方法通过纠正样式和内容差距来优于两个强基础（XLM和质量）。

Back-translation is a critical component of Unsupervised Neural Machine Translation (UNMT), which generates pseudo parallel data from target monolingual data. A UNMT model is trained on the pseudo parallel data with translated source, and translates natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. By carefully designing experiments, we identify two representative characteristics of the data gap in source: (1) style gap (i.e., translated vs. natural text style) that leads to poor generalization capability; (2) content gap that induces the model to produce hallucination content biased towards the target language. To narrow the data gap, we propose an online self-training approach, which simultaneously uses the pseudo parallel data {natural source, translated target} to mimic the inference scenario. Experimental results on several widely-used language pairs show that our approach outperforms two strong baselines (XLM and MASS) by remedying the style and content gaps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题