论文标题
DoCextractor:现成的历史文档元素提取
docExtractor: An off-the-shelf historical document element extraction
论文作者
论文摘要
我们提出了DoCextractor,这是一种通用方法,用于提取视觉元素,例如文本行或历史文档中的插图,而无需任何真实的数据注释。我们证明,它在各种数据集中提供了高质量的表演,作为一个现成的系统,并在微调时与最先进的结果相当。我们认为,在特定数据集中获得的不进行微调的性能对于应用程序,尤其是数字人文学科至关重要,并且我们要解决的线条级页细分与通用元素提取引擎最相关。我们依靠丰富的合成文档的快速生成器并设计一个完全卷积的网络,我们证明了它比基于检测的方法更好地概括了。此外,我们引入了一个新的公共数据集,称为Illuhisdoc,该数据集专门用于对历史文档中插图细分的精细评估。
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.