跨境新闻语料库抗议事件相关的知识库建设

论文标题

跨境新闻语料库抗议事件相关的知识库建设

Cross-context News Corpus for Protest Events related Knowledge Base Construction

论文作者

Hürriyetoğlu, Ali, Yörük, Erdem, Yüret, Deniz, Mutlu, Osman, Yoltar, Çağrı, Duruşan, Fırat, Gürel, Burak

论文摘要

我们描述了抗议活动的黄金标准库，其中包括来自各个国家 /地区的各种本地和国际来源。该语料库包含文档，句子和令牌级别注释。这种语料库促进了创建机器学习模型，该模型可以自动对新闻文章进行分类并提取与事件有关的信息，从而构建知识库，从而实现比较的社会和政治学研究。对于每个新闻来源，注释从新闻文章的随机样本开始，并继续使用主动学习绘制的样本。每批样本都由两个社会和政治学家注释，并由注释主管裁定，并通过半自动识别注释错误来改善。我们发现，该语料库具有在跨文本设置中开发和基准的文本分类和事件提取系统的多样性和质量，这有助于自动化文本处理系统的概括性和鲁棒性。该语料库和报告的结果将在自动抗议活动收集研究中设定目前缺乏共同点。

We describe a gold standard corpus of protest events that comprise of various local and international sources from various countries in English. The corpus contains document, sentence, and token level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases which enable comparative social and political science studies. For each news source, the annotation starts on random samples of news articles and continues with samples that are drawn using active learning. Each batch of samples was annotated by two social and political scientists, adjudicated by an annotation supervisor, and was improved by identifying annotation errors semi-automatically. We found that the corpus has the variety and quality to develop and benchmark text classification and event extraction systems in a cross-context setting, which contributes to the generalizability and robustness of automated text processing systems. This corpus and the reported results will set the currently lacking common ground in automated protest event collection studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题