指定部分注释数据集的实体识别

论文标题

指定部分注释数据集的实体识别

Named Entity Recognition for Partially Annotated Datasets

论文作者

Strobl, Michael, Trabelsi, Amine, Zaiane, Osmar

论文摘要

最常见的命名实体识别器通常是在完全注释的语料库中训练的序列标记器，即所有实体的单词类都是已知的。部分带注释的语料库，即某些类型的某些实体的某些实体是注释的，对于训练序列标记器来说太嘈杂了，因为同一实体可以用它的真实类型进行注释，但不是另一个时间，而不是误导了标记器。因此，我们正在比较部分注释数据集的三种培训策略，以及一种在没有耗时的手动数据注释的情况下，从Wikipedia中推导新类型实体的新数据集。为了正确验证我们的数据获取和培训方法是合理的，我们手动注释了两个新类别，即食品和药物。

The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题