论文标题
Hrcenternet:历史文档中汉字分割的无锚方法
HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents
论文作者
论文摘要
历史文件提供的信息在人类文明的传播中始终是必不可少的,但它也使这些书籍容易因各种因素而受到损害。多亏了最近的技术,这些文档的自动数字化是最快,最有效的保存手段之一。自动文本数字化的主要步骤可以分为两个阶段,主要是:字符分割和字符识别,其中识别结果在很大程度上取决于分割的准确性。因此,在这项研究中,我们只专注于中国历史文件的性格细分。在这项研究中,我们提出了一个名为hrcenternet的模型,该模型与无锚对象检测方法和并行架构结合使用。 MTHV2数据集由3000多个中国历史文档图像和超过100万个单独的汉字组成;借助这些庞大的数据,与其他数据相比,我们模型的分割能力平均达到了0.81,而速度准确性的权衡最佳。我们的源代码可在https://github.com/tverous/hrcenternet上找到。
The information provided by historical documents has always been indispensable in the transmission of human civilization, but it has also made these books susceptible to damage due to various factors. Thanks to recent technology, the automatic digitization of these documents are one of the quickest and most effective means of preservation. The main steps of automatic text digitization can be divided into two stages, mainly: character segmentation and character recognition, where the recognition results depend largely on the accuracy of segmentation. Therefore, in this study, we will only focus on the character segmentation of historical Chinese documents. In this research, we propose a model named HRCenterNet, which is combined with an anchorless object detection method and parallelized architecture. The MTHv2 dataset consists of over 3000 Chinese historical document images and over 1 million individual Chinese characters; with these enormous data, the segmentation capability of our model achieves IoU 0.81 on average with the best speed-accuracy trade-off compared to the others. Our source code is available at https://github.com/Tverous/HRCenterNet.