论文标题

您的令牌器有多好?关于多语言模型的单语表现

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

论文作者

Rust, Phillip, Pfeiffer, Jonas, Vulić, Ivan, Ruder, Sebastian, Gurevych, Iryna

论文摘要

在这项工作中,我们提供了预验证的多语言语言模型的系统性和全面的经验比较,而与其单语言的单语言对应物有关其单语任务绩效。我们在一组五种不同的单语言下游任务中研究了九种类型上的多种语言,并具有易于获得的单语模型。我们首先要通过公平和受控的比较来建立该语言的多语言和相应单语表示之间的差距,并随后研究任何性能差异的原因。为了解散杂交因素,我们在相同的数据上训练新的单语模型,并具有单语和多句训练的令牌。我们发现,虽然预处理的数据大小是一个重要因素,但指定的单语令牌在下游性能中起着同样重要的作用。我们的结果表明,在多语言模型的词汇表中表现出充分表示的语言在单语言上的表现可忽略不计。我们进一步发现,用专门的单语言代币替换原始的多语言代币仪可改善几乎每个任务和语言的多语言模型的下游性能。

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源