论文标题

神经机器在多语言代币培训中转化为语言失衡的稳健性如何?

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

论文作者

Zhang, Shiyue, Chaudhary, Vishrav, Goyal, Naman, Cross, James, Wenzek, Guillaume, Bansal, Mohit, Guzman, Francisco

论文摘要

多语言代币器是多语言神经机器翻译的基本组成部分。它是通过多语种语料库训练的。由于偏斜的数据分布被认为是有害的,因此通常使用采样策略来平衡语料库中的语言。但是,很少有作品系统地回答了令牌训练中语言失衡如何影响下游表现。在这项工作中,我们分析了翻译性能如何随着语言之间的数据比率而变化。我们发现,虽然当语言更加同样地采样时,经常会观察到相对较好的性能,但下游性能对语言失衡比我们通常预期的要强大。在执行任务之前,可以警告两个功能,即UNK速率和接近角色水平,可以警告下游性能差。我们还将引物训练的语言抽样与模型培训的采样相区分,并表明该模型对后者更敏感。

A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling for tokenizer training from sampling for model training and show that the model is more sensitive to the latter.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源