使用文本分析和布局功能从扫描发票图像中提取信息

论文标题

使用文本分析和布局功能从扫描发票图像中提取信息

Information Extraction from Scanned Invoice Images using Text Analysis and Layout Features

论文作者

Ha, Hien Thi, Horák, Aleš

论文摘要

虽然将发票内容作为元数据避免纸质文档处理可能是未来的趋势，但几乎所有每日发行的发票仍在纸上打印或以PDF等数字格式生成。在本文中，我们介绍了从扫描文档图像中提取信息的OCRMiner系统，该系统基于文本分析技术与布局特征结合使用（半）结构化文档的索引元数据。该系统旨在以人类读者使用的类似方式处理文档，即在协调决策中采用不同的布局和文本属性。该系统由一组互连模块组成，该模块以（可能是错误的）字符从标准OCR系统开始（可能是错误的）输出，并允许应用不同的技术并在每个步骤中扩展提取的知识。使用开源OCR，该系统能够以90％的英语恢复发票数据，而捷克设置的发票数据为88％。

While storing invoice content as metadata to avoid paper document processing may be the future trend, almost all of daily issued invoices are still printed on paper or generated in digital formats such as PDFs. In this paper, we introduce the OCRMiner system for information extraction from scanned document images which is based on text analysis techniques in combination with layout features to extract indexing metadata of (semi-)structured documents. The system is designed to process the document in a similar way a human reader uses, i.e. to employ different layout and text attributes in a coordinated decision. The system consists of a set of interconnected modules that start with (possibly erroneous) character-based output from a standard OCR system and allow to apply different techniques and to expand the extracted knowledge at each step. Using an open source OCR, the system is able to recover the invoice data in 90% for English and in 88% for the Czech set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题