论文标题

使用机器学习来识别大型文字语料库中的历史旅行

Identifying Historical Travelogues in Large Text Corpora Using Machine Learning

论文作者

Rörden, Jan, Gruber, Doris, Krickl, Martin, Haslhofer, Bernhard

论文摘要

Travelogues代表了人文学科学者的重要且深入研究的来源,因为它们为人们,文化和过去的地方提供了见解。但是,现有的研究很少使用十几个主要来源,因为与大量历史资源一起工作的人类能力自然受到限制。在本文中,我们定义了旅行的概念,并报告了一种跨学科方法,该方法使用机器学习以及领域知识,可以有效地在奥地利国家图书馆的数字化库存中有效地识别F1分数在0.94至1.00之间的数字化库存中。我们将方法应用于161,522卷的语料库,并确定了345种无法使用传统搜索方法来识别的旅行,从而导致了有史以来最广泛的现代德国旅行社收集。据我们所知,这是第一次实施此类方法,以在此规模上对文本语料库进行书目索引,从而改善和扩展人文科学中的传统方法。总体而言,我们认为我们的技术是为开发新型混合方法方法进行大规模旅行的大规模序列分析的更广泛努力的重要第一步。

Travelogues represent an important and intensively studied source for scholars in the humanities, as they provide insights into people, cultures, and places of the past. However, existing studies rarely utilize more than a dozen primary sources, since the human capacities of working with a large number of historical sources are naturally limited. In this paper, we define the notion of travelogue and report upon an interdisciplinary method that, using machine learning as well as domain knowledge, can effectively identify German travelogues in the digitized inventory of the Austrian National Library with F1 scores between 0.94 and 1.00. We applied our method on a corpus of 161,522 German volumes and identified 345 travelogues that could not be identified using traditional search methods, resulting in the most extensive collection of early modern German travelogues ever created. To our knowledge, this is the first time such a method was implemented for the bibliographic indexing of a text corpus on this scale, improving and extending the traditional methods in the humanities. Overall, we consider our technique to be an important first step in a broader effort of developing a novel mixed-method approach for the large-scale serial analysis of travelogues.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源