Indonlu：评估印尼自然语言理解的基准和资源

论文标题

Indonlu：评估印尼自然语言理解的基准和资源

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

论文作者

Wilie, Bryan, Vincentio, Karissa, Winata, Genta Indra, Cahyawijaya, Samuel, Li, Xiaohong, Lim, Zhi Yuan, Soleman, Sidik, Mahendra, Rahmad, Fung, Pascale, Bahar, Syafri, Purwarianti, Ayu

论文摘要

尽管众所周知，印度尼西亚人是互联网上第四种最常用的语言，但由于缺乏可用的资源，自然语言处理（NLP）中这种语言的研究进展正在缓慢移动。作为回应，我们介绍了有史以来第一个大量的资源，用于培训，评估和基准印尼自然语言理解（Indonlu）任务。 Indonlu包括十二个任务，从单句分类到具有不同级别的复杂性级别的配对句子序列标记。任务的数据集在不同的域和样式中，以确保任务多样性。我们还提供了一组来自从公共可用来源（例如社交媒体文本，博客，新闻和网站）收集的大型印尼数据集Indo4b培训的印尼预培训模型（Indobert）。我们发布了所有十二个任务的基线模型，以及基准评估的框架，因此它使每个人都可以基准他们的系统性能。

Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, and thus it enables everyone to benchmark their system performances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题