通言神经机器翻译的合成源语言增强

论文标题

通言神经机器翻译的合成源语言增强

Synthetic Source Language Augmentation for Colloquial Neural Machine Translation

论文作者

Ariesandy, Asrul Sani, Amien, Mukhlis, Aji, Alham Fikri, Prasojo, Radityo Eko

论文摘要

神经机器翻译（NMT）通常取决于域的依赖性和样式依赖性，并且需要大量的培训数据。最先进的NMT模型通常在处理其源语言的口语变化方面通常缺乏，并且在这方面缺乏并行数据是系统地改善现有模型的挑战。在这项工作中，我们开发了一种新型的口语印尼 - 英语测试集，该测试集是从YouTube成绩单和Twitter收集的。我们对正式印尼语言的来源进行合成样式的增强，并表明它可以改善基线ID-EN模型（在BLEU中），而不是新的测试数据。

Neural machine translation (NMT) is typically domain-dependent and style-dependent, and it requires lots of training data. State-of-the-art NMT models often fall short in handling colloquial variations of its source language and the lack of parallel data in this regard is a challenging hurdle in systematically improving the existing models. In this work, we develop a novel colloquial Indonesian-English test-set collected from YouTube transcript and Twitter. We perform synthetic style augmentation to the source of formal Indonesian language and show that it improves the baseline Id-En models (in BLEU) over the new test data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题