论文标题
网络数据上的无监督平行语料库挖掘
Unsupervised Parallel Corpus Mining on Web Data
论文作者
论文摘要
有了大量的并行数据,神经机器翻译系统能够为句子级翻译提供人级的性能。但是,标记人类大量并行数据是昂贵的。相比之下,人类在互联网上创建了大规模的平行语料库。利用它们的主要困难是如何将它们从噪声网站环境中滤出。当前的并行数据挖掘方法都需要标记为并行数据作为训练源。在本文中,我们提出了一条管道,以无监督的方式从互联网上开采平行语料库。在广泛使用的WMT'14英语 - 法国和WMT'16英语基准测试中,通过管道提取的数据训练的机器翻译器可以实现非常接近的性能。在WMT'16英国罗马语和罗马尼亚英语基准测试中,我们的系统即使与监督的方法相比,我们的系统也会产生新的最先进的结果,即39.81和38.95 BLEU分数。
With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.