用于促进历史文档中发布的OCR的工具

论文标题

用于促进历史文档中发布的OCR的工具

A Tool for Facilitating OCR Postediting in Historical Documents

论文作者

Poncelas, Alberto, Aboomar, Mohammad, Buts, Jan, Hadley, James, Way, Andy

论文摘要

历史文档的光学特征识别（OCR）是一个复杂的过程，但要遵守一组独特的材料问题，包括字体上的不一致和低质量扫描。因此，即使是最复杂的OCR发动机也会产生错误。本文报告了用于发布Tesseract输出的工具，更专门用于纠正数字化历史文档中的常见错误。提出的工具建议在指定词汇中找到的单词形式的替代方案。假定的错误被基于语言模型（LM）的分数中的可能正确的替代方案所取代。该工具在本书的一章中进行了测试，以规范贸易并利用该王国的穷人（Cary，1719）。如下所示，该工具成功地纠正了许多常见错误。如果有时不可靠，它也是透明的，并且受到人类干预的约束。

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

下载PDF全文

下载文献需遵守相关版权规定

论文标题