论文标题
用语言簇的词汇改进多语言模型
Improving Multilingual Models with Language-Clustered Vocabularies
论文作者
论文摘要
最先进的多语言模型取决于词汇表,这些词汇涵盖了该模型期望在推理时看到的所有语言,但是生成这些词汇的标准方法对于大量多语言应用并不理想。在这项工作中,我们介绍了一种新的程序,用于多种语言词汇一代,该程序结合了几个自动派生的语言簇的单独训练的词汇,从而平衡了跨语言子词共享和语言特定词汇之间的权衡。我们的实验表明,在关键的多语言基准任务上,跨语言的改进tydi QA(+2.9 F1),XNLI(+2.1 \%)和Wikiann Ner(+2.8 F1)(+2.8 F1)以及vocabulary速率减少8倍,而均不增加模型或数据的大小。
State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1\%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.