论文标题
DIT:文档图像变压器的自我监督预训练
DiT: Self-supervised Pre-training for Document Image Transformer
论文作者
论文摘要
图像变压器最近使用监督(VIT,DEIT等)或自我监督(BEIT,MAE等)预训练技术取得了显着的自然图像理解进展。 In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images.我们在各种基于视觉的文档AI任务中利用DIT作为骨干网络,包括文档图像分类,文档布局分析,表检测以及OCR的文本检测。实验结果表明,自我监管的预训练的DIT模型可在这些下游任务上实现新的最新结果,例如文档图像分类(91.11 $ \ rightarrow $ 92.69),文档布局分析(91.0 $ \ rightarrow $ 94.9),表检测(94.23 $ \ rightarrow $ 96.55)和OCR的文本检测(93.07 $ \ rightarrow $ 94.29)。代码和预培训模型可在\ url {https://aka.ms/msdit}上公开获得。
Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.