论文标题
低资源机器翻译的参与性研究:非洲语言的案例研究
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
论文作者
论文摘要
NLP的研究缺乏地理多样性,以及如何将NLP缩放到低资源语言的问题尚未得到充分解决。 “低资源” - 性是一个复杂的问题,远远超出了数据可用性,反映了社会的系统性问题。在本文中,我们专注于机器翻译(MT)的任务,该任务在全球的信息访问性和通信中起着至关重要的作用。尽管在过去的十年中,MT有了很大的改善,但MT仍以几种高资源的语言为中心。由于MT研究人员无法仅解决低资源的问题,因此我们提出参与性研究,以此作为使MT开发过程中所有必要代理的一种手段。我们通过对非洲语言的MT进行案例研究来证明参与性研究的可行性和可伸缩性。它的实施导致了一系列新颖的翻译数据集,30多种语言的MT基准,对其中的三分之一进行了评估,并使参与者可以在没有正规培训的情况下做出独特的科学贡献。根据https://github.com/masakhane-io/masakhane-mt发布基准,模型,数据,代码和评估结果。
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.