阅读和写作：自我监督文本识别的歧视性和生成性建模

论文标题

阅读和写作：自我监督文本识别的歧视性和生成性建模

Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition

论文作者

Yang, Mingkun, Liao, Minghui, Lu, Pu, Wang, Jing, Zhu, Shenggao, Luo, Hualin, Tian, Qi, Bai, Xiang

论文摘要

现有的文本识别方法通常需要大规模培训数据。由于缺乏带注释的真实图像，他们中的大多数依靠合成训练数据。但是，合成数据和真实数据之间存在域差距，这限制了文本识别模型的性能。最近的自我监督文本识别方法试图通过引入对比度学习来利用未标记的真实图像，这主要学习文本图像的歧视。受到人类学会通过阅读和写作识别文本的观察的启发，我们建议通过将对比度学习和在我们的自我监督方法中整合和掩盖的图像建模来学习歧视和产生。采用对比学习分支来学习文本图像的歧视，这模仿了人类的阅读行为。同时，首先引入了蒙版的图像建模，以了解文本识别，以了解文本图像的上下文生成，这与写作行为类似。实验结果表明，在不规则场景文本识别数据集上，我们的方法优于以前的自我监督文本识别方法的10.2％-20.2％。此外，我们提出的文本识别器超过了先前的最新文本识别方法，在11个基准测试基准上以相似的模型大小的11个基准测试。我们还证明，我们的预培训模型可以轻松地应用于具有明显的性能增益的其他与文本相关的任务。该代码可在https://github.com/ayumiymk/dig上找到。

Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain. The code is available at https://github.com/ayumiymk/DiG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题