DIT：文档图像变压器的自我监督预训练

论文标题

DIT：文档图像变压器的自我监督预训练

DiT: Self-supervised Pre-training for Document Image Transformer

论文作者

Li, Junlong, Xu, Yiheng, Lv, Tengchao, Cui, Lei, Zhang, Cha, Wei, Furu

论文摘要

图像变压器最近使用监督（VIT，DEIT等）或自我监督（BEIT，MAE等）预训练技术取得了显着的自然图像理解进展。 In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images.我们在各种基于视觉的文档AI任务中利用DIT作为骨干网络，包括文档图像分类，文档布局分析，表检测以及OCR的文本检测。实验结果表明，自我监管的预训练的DIT模型可在这些下游任务上实现新的最新结果，例如文档图像分类（91.11 $ \ rightarrow $ 92.69），文档布局分析（91.0 $ \ rightarrow $ 94.9），表检测（94.23 $ \ rightarrow $ 96.55）和OCR的文本检测（93.07 $ \ rightarrow $ 94.29）。代码和预培训模型可在\ url {https://aka.ms/msdit}上公开获得。

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题