论文标题
EZCOREF:统一核心解决方案注释指南
ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
论文作者
论文摘要
大规模的高质量语料库对于推进核心解决方案的研究至关重要。但是,现有数据集在其对核心发作的定义方面有所不同,并且通过为语言专家策划的复杂而冗长的指南收集。这些担忧激发了研究人员越来越兴趣策划一套适合具有不同背景的注释者的统一准则。在这项工作中,我们开发了众包友好的核心注释方法,Ezcoref,由注释工具和交互式教程组成。我们使用EZCOREF从七个现有的英语核心数据集(跨越小说,新闻和多个域)重新通知了240段,而仅在这些数据集中教授类似处理的注释案例。令人惊讶的是,我们发现即使没有进行广泛的培训,也可以实现合理的质量注释(人群和专家注释之间的共识为90%)。在仔细分析其余分歧时,我们确定了我们的注释者一致同意的语言案例的存在,但缺乏现有数据集中的统一处理(例如,通用代词,附属物)。我们建议研究界在策划未来的统一注释指南时重新审视这些现象。
Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (>90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.