DoCextractor：现成的历史文档元素提取

论文标题

DoCextractor：现成的历史文档元素提取

docExtractor: An off-the-shelf historical document element extraction

论文作者

Monnier, Tom, Aubry, Mathieu

论文摘要

我们提出了DoCextractor，这是一种通用方法，用于提取视觉元素，例如文本行或历史文档中的插图，而无需任何真实的数据注释。我们证明，它在各种数据集中提供了高质量的表演，作为一个现成的系统，并在微调时与最先进的结果相当。我们认为，在特定数据集中获得的不进行微调的性能对于应用程序，尤其是数字人文学科至关重要，并且我们要解决的线条级页细分与通用元素提取引擎最相关。我们依靠丰富的合成文档的快速生成器并设计一个完全卷积的网络，我们证明了它比基于检测的方法更好地概括了。此外，我们引入了一个新的公共数据集，称为Illuhisdoc，该数据集专门用于对历史文档中插图细分的精细评估。

We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题