关系提取是什么意思？一项有关数据集的调查并研究科学关系分类

论文标题

关系提取是什么意思？一项有关数据集的调查并研究科学关系分类

What do You Mean by Relation Extraction? A Survey on Datasets and Study on Scientific Relation Classification

论文作者

Bassignana, Elisa, Plank, Barbara

论文摘要

在过去的五年中，有关关系提取（RE）的研究通过许多新的数据集发行了广泛的进展。同时，设置清晰度也有所下降，导致了可靠的经验评估难度增加（Taillé等，2020）。在本文中，我们对RE数据集进行了全面的调查，并重新审视了社区的任务定义及其采用。我们发现跨数据库和跨域设置尤其缺乏。我们介绍了两个数据集的科学关系分类的实证研究。尽管数据重叠，但我们的分析揭示了注释的巨大差异。注释差异强烈影响关系分类绩效，解释了跨数据库评估中的大量下降。存在进一步的子域内的变化，但仅影响有限程度的关系分类。总体而言，我们的研究要求在重新报告和评估多个测试集的评估中更加严格。

Over the last five years, research on Relation Extraction (RE) witnessed extensive progress with many new dataset releases. At the same time, setup clarity has decreased, contributing to increased difficulty of reliable empirical evaluation (Taillé et al., 2020). In this paper, we provide a comprehensive survey of RE datasets, and revisit the task definition and its adoption by the community. We find that cross-dataset and cross-domain setups are particularly lacking. We present an empirical study on scientific Relation Classification across two datasets. Despite large data overlap, our analysis reveals substantial discrepancies in annotation. Annotation discrepancies strongly impact Relation Classification performance, explaining large drops in cross-dataset evaluations. Variation within further sub-domains exists but impacts Relation Classification only to limited degrees. Overall, our study calls for more rigour in reporting setups in RE and evaluation across multiple test sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题