论文标题
大规模文本语料库中无监督的关键事件检测
Unsupervised Key Event Detection from Massive Text Corpora
论文作者
论文摘要
新闻库中的自动事件检测是开采快速发展的结构性知识的至关重要的任务。由于现实世界事件具有不同的粒度,从顶级主题到关键事件,再到与具体行动相对应的事件提及,通常有两条研究:(1)主题检测从新闻语料库的主要主题(例如,“ 2019年香港抗议)中的新闻语料库抗议”(例如,香港抗议活动)vs.“ 2020美国总统选举”)具有非常明显的语义态度的语义; (2)从一个文件提取的行动提取提取级别的行动(例如,“警察击中抗议者的左臂”),无法理解该事件。在本文中,我们提出了一项新任务,即在中间级别的关键事件检测,目的是从新闻语料库的关键事件(例如,“ 8月12日至14日的HK机场抗议)中检测到,每项事件都在特定的时间/位置进行,并专注于同一主题。由于新闻文章的快速发展性质,这项任务可以桥接事件的理解和结构,因为关键事件的主题和时间紧密以及标记的数据的稀缺性,因此本质上具有挑战性。 To address these challenges, we develop an unsupervised key event detection framework, EvMine, that (1) extracts temporally frequent peak phrases using a novel ttf-itf score, (2) merges peak phrases into event-indicative feature sets by detecting communities from our designed peak phrase graph that captures document co-occurrences, semantic similarities, and temporal closeness signals, and (3) iteratively retrieves documents与每个关键事件相关的事件通过训练从事件指示功能集中自动生成的伪标签的分类器,并使用检索到的文档来完善检测到的关键事件。广泛的实验和案例研究表明,Evmine的表现优于所有基线方法及其在两个现实世界新闻机构上的消融。
Automated event detection from news corpora is a crucial task towards mining fast-evolving structured knowledge. As real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions, there are generally two lines of research: (1) theme detection identifies from a news corpus major themes (e.g., "2019 Hong Kong Protests" vs. "2020 U.S. Presidential Election") that have very distinct semantics; and (2) action extraction extracts from one document mention-level actions (e.g., "the police hit the left arm of the protester") that are too fine-grained for comprehending the event. In this paper, we propose a new task, key event detection at the intermediate level, aiming to detect from a news corpus key events (e.g., "HK Airport Protest on Aug. 12-14"), each happening at a particular time/location and focusing on the same topic. This task can bridge event understanding and structuring and is inherently challenging because of the thematic and temporal closeness of key events and the scarcity of labeled data due to the fast-evolving nature of news articles. To address these challenges, we develop an unsupervised key event detection framework, EvMine, that (1) extracts temporally frequent peak phrases using a novel ttf-itf score, (2) merges peak phrases into event-indicative feature sets by detecting communities from our designed peak phrase graph that captures document co-occurrences, semantic similarities, and temporal closeness signals, and (3) iteratively retrieves documents related to each key event by training a classifier with automatically generated pseudo labels from the event-indicative feature sets and refining the detected key events using the retrieved documents. Extensive experiments and case studies show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora.