论文标题

CODA-19:使用非专家人群注释COVID-19中10,000多个摘要的研究方面开放研究数据集

CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset

论文作者

Huang, Ting-Hao 'Kenneth', Huang, Chieh-Yang, Ding, Chien-Kuang Cornelia, Hsu, Yen-Chia, Giles, C. Lee

论文摘要

本文介绍了CODA-19,这是一个编码背景,目的,方法,查找/贡献的人类通知数据集,以及COVID-19中的10,966个英语摘要的其他部分。 CODA-19是由来自亚马逊机械土耳其人的248名群众创建的,并在10天内实现了与专家相当的质量标签。每个摘要都由九个不同的工人注释,最终标签是通过多数票获得的。人群与生物医学专家(0.741)之间的通道间协议(Cohen's Kappa)可与Expert Inter-Expert协议相当(0.788)。与生物医学专家的标签相比,CODA-19的标签的精度为82.2%,而专家之间的准确性为85.0%。可靠的人类注释有助于科学家访问和整合快速加速的冠状病毒文献,并作为AI/NLP研究的电池,但是获得专家注释可能会很慢。我们证明,可以在大规模上迅速使用非专家人群来参加与Covid-19的斗争。

This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different workers, and the final labels were acquired by majority vote. The inter-annotator agreement (Cohen's kappa) between the crowd and the biomedical expert (0.741) is comparable to inter-expert agreement (0.788). CODA-19's labels have an accuracy of 82.2% when compared to the biomedical expert's labels, while the accuracy between experts was 85.0%. Reliable human annotations help scientists access and integrate the rapidly accelerating coronavirus literature, and also serve as the battery of AI/NLP research, but obtaining expert annotations can be slow. We demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源