论文标题
Humset:多语言信息提取和人道主义危机反应的分类数据集
HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response
论文作者
论文摘要
及时有效的对人道主义危机的反应需要对大量文本数据进行快速准确的分析,该过程可以从专家辅助的NLP系统中受益于经过验证和注释的人道主义响应域中的培训。为了创建此类NLP系统,我们介绍和发布Humset,这是由人道主义反应社区专家注释的人道主义反应文件的新颖而丰富的多语言数据集。该数据集提供了三种语言(英语,法语,西班牙语)的文档,并在全球范围内涵盖了2018年至2021年的各种人道主义危机。对于每个文档,HUMSET使用常见的人道主义信息分析框架为每个条目提供了选定的片段(条目)以及分配的类。 HUMSET还提供了新颖且具有挑战性的入门提取和多标签进入分类任务。在本文中,我们朝着完成这些任务迈出的第一步,并对预训练的语言模型(PLM)进行一系列实验,以建立强大的基准,以在该领域的未来研究。该数据集可从https://blog.thedeep.io/humset/获得。
Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https://blog.thedeep.io/humset/.