传记：半监督关系提取数据集

论文标题

传记：半监督关系提取数据集

Biographical: A Semi-Supervised Relation Extraction Dataset

论文作者

Plum, Alistair, Ranasinghe, Tharindu, Jones, Spencer, Orasan, Constantin, Mitkov, Ruslan

论文摘要

从在线文档中提取传记信息是信息提取（IE）社区中的一个流行研究主题。通常使用各种自然语言处理（NLP）技术，例如文本分类，文本摘要和关系提取。在这些技术中，RE是最常见的，因为它可以直接用于构建传记知识图。 RE通常被构架为有监督的机器学习（ML）问题，其中ML模型在注释数据集上进行了培训。但是，由于注释过程的昂贵且耗时，因此很少有注释的数据集用于RE。为了解决这个问题，我们开发了第一个半监督数据集的传记。该数据集的目的是针对数字人文科学（DH）和历史研究，是通过将Wikipedia文章的句子与包括万神殿和Wikidata在内的匹配的结构化数据进行对齐的句子来自动编制的。通过利用Wikipedia文章的结构和鲁棒的命名实体识别（NER），我们以相对较高的精度匹配信息，以编译带有十种不同关系的注释关系对，这在DH域中很重要。此外，我们通过训练最先进的神经模型对关系对进行分类并在手动注释的黄金标准集上进行评估，从而证明了数据集的有效性。传记主要旨在训练在数字人文和历史领域内的神经模型，但是正如我们在本文结束时所讨论的那样，它也对其他目的也很有用。

Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developed Biographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set. Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题