论文标题

人指数挑战:从凌乱的短文中提取人

The Person Index Challenge: Extraction of Persons from Messy, Short Texts

论文作者

Schröder, Markus, Jilek, Christian, Schulze, Michael, Dengel, Andreas

论文摘要

当文本中提到其名字,姓氏和/或中间名的文本时,可能会有很高的变化,其名称的使用,其名称的排序方式以及其名称是否被缩写。如果以截然不同的方式连续提及多个人,尤其是短文可以被视为“混乱”。一旦出现模棱两可的名称,就可能无法正确推断与人的关联。尽管有这些可能性,但在本文中,我们询问一种无监督的算法可以从短文中构建一个人索引。我们将一个人索引定义为结构化的表,该表明显地通过其名称将个人分类。首先,我们给出了问题的正式定义,并描述了为将来评估生成基础真相数据的程序。为了解决这一挑战的第一个解决方案,实施了基线方法。通过使用我们提出的评估策略,我们测试基线的性能并提出进一步的改进。对于将来的研究,源代码将公开可用。

When persons are mentioned in texts with their first name, last name and/or middle names, there can be a high variation which of their names are used, how their names are ordered and if their names are abbreviated. If multiple persons are mentioned consecutively in very different ways, especially short texts can be perceived as "messy". Once ambiguous names occur, associations to persons may not be inferred correctly. Despite these eventualities, in this paper we ask how well an unsupervised algorithm can build a person index from short texts. We define a person index as a structured table that distinctly catalogs individuals by their names. First, we give a formal definition of the problem and describe a procedure to generate ground truth data for future evaluations. To give a first solution to this challenge, a baseline approach is implemented. By using our proposed evaluation strategy, we test the performance of the baseline and suggest further improvements. For future research the source code is publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源