论文标题

用于促进历史文档中发布的OCR的工具

A Tool for Facilitating OCR Postediting in Historical Documents

论文作者

Poncelas, Alberto, Aboomar, Mohammad, Buts, Jan, Hadley, James, Way, Andy

论文摘要

历史文档的光学特征识别(OCR)是一个复杂的过程,但要遵守一组独特的材料问题,包括字体上的不一致和低质量扫描。因此,即使是最复杂的OCR发动机也会产生错误。本文报告了用于发布Tesseract输出的工具,更专门用于纠正数字化历史文档中的常见错误。提出的工具建议在指定词汇中找到的单词形式的替代方案。假定的错误被基于语言模型(LM)的分数中的可能正确的替代方案所取代。该工具在本书的一章中进行了测试,以规范贸易并利用该王国的穷人(Cary,1719)。如下所示,该工具成功地纠正了许多常见错误。如果有时不可靠,它也是透明的,并且受到人类干预的约束。

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源