低资源语言的培训前数据质量和数量：马耳他的新语料库和BERT模型

论文标题

低资源语言的培训前数据质量和数量：马耳他的新语料库和BERT模型

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

论文作者

Micallef, Kurt, Gatt, Albert, Tanti, Marc, van der Plas, Lonneke, Borg, Claudia

论文摘要

多语言的语言模型（例如Mbert）看到了令人印象深刻的跨语性转移到各种语言，但是在这些模型中仍然排除了许多语言。在本文中，我们分析了使用单语言数据进行预训练的效果，用于低资源语言，该语言不包括在马耳他 - 马耳他 - 具有一系列预培训设置的效果。我们对新的预训练的模型进行评估，以实现三个形态句法的任务 - 依赖性解析，言论部分标记和指定的实现识别 - 以及一个语义分类任务 - 情感分析。我们还提出了一个新创建的马耳他语料库，并确定训练前数据大小和域对下游性能的影响。我们的结果表明，使用预训练域的混合物通常仅优于使用Wikipedia文本。我们还发现，这种语料库的一部分足以在受Wikipedia训练的模型上取得显着的性能。我们预先培训并比较了新的语料库上的两个模型：从头开始训练的单语BERT模型（Bertu）和进一步的预先训练的多语言Bert（Mbertu）。尽管新的语料库比通常用于高资源语言的语料库要小得多，但这些模型在这些任务上实现了最先进的性能。平均而言，Bertu的表现胜过或与Mbertu竞争性，并且在更高级别的任务中观察到最大的收益。

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题