论文标题

低资源语言的培训前数据质量和数量:马耳他的新语料库和BERT模型

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

论文作者

Micallef, Kurt, Gatt, Albert, Tanti, Marc, van der Plas, Lonneke, Borg, Claudia

论文摘要

多语言的语言模型(例如Mbert)看到了令人印象深刻的跨语性转移到各种语言,但是在这些模型中仍然排除了许多语言。在本文中,我们分析了使用单语言数据进行预训练的效果,用于低资源语言,该语言不包括在马耳他 - 马耳他 - 具有一系列预培训设置的效果。我们对新的预训练的模型进行评估,以实现三个形态句法的任务 - 依赖性解析,言论部分标记和指定的实现识别 - 以及一个语义分类任务 - 情感分析。我们还提出了一个新创建的马耳他语料库,并确定训练前数据大小和域对下游性能的影响。我们的结果表明,使用预训练域的混合物通常仅优于使用Wikipedia文本。我们还发现,这种语料库的一部分足以在受Wikipedia训练的模型上取得显着的性能。我们预先培训并比较了新的语料库上的两个模型:从头开始训练的单语BERT模型(Bertu)和进一步的预先训练的多语言Bert(Mbertu)。尽管新的语料库比通常用于高资源语言的语料库要小得多,但这些模型在这些任务上实现了最先进的性能。平均而言,Bertu的表现胜过或与Mbertu竞争性,并且在更高级别的任务中观察到最大的收益。

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源