低资源机器翻译的参与性研究：非洲语言的案例研究

论文标题

低资源机器翻译的参与性研究：非洲语言的案例研究

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

论文作者

Nekoto, Wilhelmina, Marivate, Vukosi, Matsila, Tshinondiwa, Fasubaa, Timi, Kolawole, Tajudeen, Fagbohungbe, Taiwo, Akinola, Solomon Oluwole, Muhammad, Shamsuddeen Hassan, Kabongo, Salomon, Osei, Salomey, Freshia, Sackey, Niyongabo, Rubungo Andre, Macharm, Ricky, Ogayo, Perez, Ahia, Orevaoghene, Meressa, Musie, Adeyemi, Mofe, Mokgesi-Selinga, Masabata, Okegbemi, Lawrence, Martinus, Laura Jane, Tajudeen, Kolawole, Degila, Kevin, Ogueji, Kelechi, Siminyu, Kathleen, Kreutzer, Julia, Webster, Jason, Ali, Jamiil Toure, Abbott, Jade, Orife, Iroro, Ezeani, Ignatius, Dangana, Idris Abdulkabir, Kamper, Herman, Elsahar, Hady, Duru, Goodness, Kioko, Ghollah, Murhabazi, Espoir, van Biljon, Elan, Whitenack, Daniel, Onyefuluchi, Christopher, Emezue, Chris, Dossou, Bonaventure, Sibanda, Blessing, Bassey, Blessing Itoro, Olabiyi, Ayodele, Ramkilowan, Arshath, Öktem, Alp, Akinfaderin, Adewale, Bashir, Abdallah

论文摘要

NLP的研究缺乏地理多样性，以及如何将NLP缩放到低资源语言的问题尚未得到充分解决。 “低资源” - 性是一个复杂的问题，远远超出了数据可用性，反映了社会的系统性问题。在本文中，我们专注于机器翻译（MT）的任务，该任务在全球的信息访问性和通信中起着至关重要的作用。尽管在过去的十年中，MT有了很大的改善，但MT仍以几种高资源的语言为中心。由于MT研究人员无法仅解决低资源的问题，因此我们提出参与性研究，以此作为使MT开发过程中所有必要代理的一种手段。我们通过对非洲语言的MT进行案例研究来证明参与性研究的可行性和可伸缩性。它的实施导致了一系列新颖的翻译数据集，30多种语言的MT基准，对其中的三分之一进行了评估，并使参与者可以在没有正规培训的情况下做出独特的科学贡献。根据https://github.com/masakhane-io/masakhane-mt发布基准，模型，数据，代码和评估结果。

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题