西班牙句子表示的评估基准

论文标题

西班牙句子表示的评估基准

Evaluation Benchmarks for Spanish Sentence Representations

论文作者

Araujo, Vladimir, Carvallo, Andrés, Kundu, Souvik, Cañete, José, Mendoza, Marcelo, Mercer, Robert E., Bravo-Marquez, Felipe, Moens, Marie-Francine, Soto, Alvaro

论文摘要

由于预先训练的语言模型的成功，近年来已经发布了英语以外的其他语言版本。这一事实意味着需要资源评估这些模型。就西班牙语而言，有很少的方法可以系统地评估模型的质量。在本文中，我们通过构建两个评估基准来缩小差距。受到以前的工作的启发（Conneau和Kiela，2018； Chen等，2019），我们介绍了西班牙语和西班牙的迪斯科瓦尔，旨在分别评估独立和话语所见的句子陈述的能力。我们的基准包括相当大的现有和新建的数据集，这些数据集可解决来自各个域的不同任务。此外，我们评估和分析了最新的预培训的西班牙语模型，以展示其能力和局限性。例如，我们发现，对于话语评估任务的情况，Mbert是一种接受多种语言训练的语言模型，通常比仅在西班牙语中接受文档的模型提供更丰富的潜在表示。我们希望我们的贡献能够激发一种更公平，更可比，更不麻烦的方法来评估未来的西班牙语模型。

Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models' quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018; Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations. As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题