论文标题
通过多语言转移学习改善土著语言的神经机器翻译
Improving Neural Machine Translation of Indigenous Languages with Multilingual Transfer Learning
论文作者
论文摘要
由于缺乏足够的并行数据,涉及土著语言的机器翻译(MT)(包括可能濒临灭绝的语言)的机器翻译(MT)具有挑战性。我们描述了一种在转移学习环境中利用双语和多语言审计的MT模型的方法,从西班牙语转化为十种南美土著语言。我们的模型为我们考虑的十对语言中的五个中的五个设定了新的SOTA,甚至可以在这五对中的一项表现加倍。与以前的SOTA执行数据扩大以扩大火车集不同,我们保留了低资源设置,以在这种约束下测试模型的有效性。尽管有有关土著语言的语言信息的罕见性,但我们提供了许多定量和定性分析(例如,关于形态学,代币化和拼字法),以将我们的结果与之相关。
Machine translation (MT) involving Indigenous languages, including those possibly endangered, is challenging due to lack of sufficient parallel data. We describe an approach exploiting bilingual and multilingual pretrained MT models in a transfer learning setting to translate from Spanish to ten South American Indigenous languages. Our models set new SOTA on five out of the ten language pairs we consider, even doubling performance on one of these five pairs. Unlike previous SOTA that perform data augmentation to enlarge the train sets, we retain the low-resource setting to test the effectiveness of our models under such a constraint. In spite of the rarity of linguistic information available about the Indigenous languages, we offer a number of quantitative and qualitative analyses (e.g., as to morphology, tokenization, and orthography) to contextualize our results.