论文标题

从新闻中提取实体和主题并连接犯罪记录

Extracting Entities and Topics from News and Connecting Criminal Records

论文作者

Pham, Quang, Stanojevic, Marija, Obradovic, Zoran

论文摘要

本文的目的是总结用于从犯罪记录数据库和报纸数据库中提取实体和主题的方法。统计模型已成功地用于研究大约300,000篇《纽约时报》文章的主题。此外,这些模型还用于成功分析与人,组织和地点相关的实体(D Newman,2006年)。此外,在某些研究中使用了分析方法,尤其是在热点映射中,目的是预测未来的犯罪地点和环境,并且这些方法已经得到了非常成功的测试(S Chainey,2008)。基于上述两个概念,这项研究的目的是将数据科学技术应用于分析大量数据,选择有价值的智能,根据其犯罪类型的类型以及创建随着时间变化的犯罪图。在这项研究中,任务是从Kaggle下载犯罪数据集,并从Kaggle和Eager Project数据库中下载新闻文章,然后将这些数据集合并到一个一般数据集中。该项目的最重要目标是执行统计和自然语言处理方法,以提取实体和主题以及将相似的数据点分为正确的群集,以便更好地了解有关美国相关犯罪的公共数据。

The goal of this paper is to summarize methodologies used in extracting entities and topics from a database of criminal records and from a database of newspapers. Statistical models had successfully been used in studying the topics of roughly 300,000 New York Times articles. In addition, these models had also been used to successfully analyze entities related to people, organizations, and places (D Newman, 2006). Additionally, analytical approaches, especially in hotspot mapping, were used in some researches with an aim to predict crime locations and circumstances in the future, and those approaches had been tested quite successfully (S Chainey, 2008). Based on the two above notions, this research was performed with the intention to apply data science techniques in analyzing a big amount of data, selecting valuable intelligence, clustering violations depending on their types of crime, and creating a crime graph that changes through time. In this research, the task was to download criminal datasets from Kaggle and a collection of news articles from Kaggle and EAGER project databases, and then to merge these datasets into one general dataset. The most important goal of this project was performing statistical and natural language processing methods to extract entities and topics as well as to group similar data points into correct clusters, in order to understand public data about U.S related crimes better.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源