重新审视印度语言在机器翻译中的低资源状态

论文标题

重新审视印度语言在机器翻译中的低资源状态

Revisiting Low Resource Status of Indian Languages in Machine Translation

论文作者

Philip, Jerin, Siripragada, Shashank, Namboodiri, Vinay P., Jawahar, C. V.

论文摘要

由于缺乏大规模多语言句子的统一语料库和强大的基准测试，印度语言机器翻译性能受到阻碍。通过本文，我们提供并分析一个自动框架，以获取印度语言神经机器翻译（NMT）系统的这种语料库。我们的管道由基线NMT系统，检索模块和一个对齐模块组成，该模块用于与政府的新闻发布等公开网站合作。对这项工作的主要贡献是获得一种增量方法，该方法使用上述管道迭代改善语料库的大小，并改善系统的每个组件。通过我们的工作，我们还评估了设计选择，例如旋转语言的选择以及迭代性增量增加的效果。除了提供自动化框架外，我们的工作还导致与现有的印度语言可用的语料库相比，产生相对较大的语料库。该语料库可帮助我们获得公开可用的WAT评估基准和其他标准评估基准的大幅改进结果。

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题