ITERVM：场景文本识别的迭代视觉建模模块

论文标题

ITERVM：场景文本识别的迭代视觉建模模块

IterVM: Iterative Vision Modeling Module for Scene Text Recognition

论文作者

Chu, Xiaojie, Wang, Yongtao

论文摘要

场景文本识别（STR）是一个具有挑战性的问题，因为自然图像中的图像条件不完美。最先进的方法利用视觉提示和语言知识来解决这个具有挑战性的问题。具体而言，他们提出了迭代语言建模模块（ITERLM），以重复从视觉建模模块（VM）中改进输出序列。尽管取得了令人鼓舞的结果，但视觉建模模块已成为这些方法的性能瓶颈。在本文中，我们新提出迭代视觉建模模块（ITERVM），以进一步提高STR的精度。具体而言，第一个VM直接从输入图像中提取多级特征，并从输入图像中重新提取多级特征，并将它们与上一个VM提取的高级（即最语义上的一个）功能融合在一起。通过将提出的ITERVM与迭代语言建模模块相结合，我们进一步提出了一个强大的场景文本识别器，称为iternet。广泛的实验表明，提出的ITERVM可以显着提高场景文本识别精度，尤其是在低质量的场景文本图像上。此外，拟议的场景文本识别器ITERNET在几个公共基准上取得了新的最新结果。代码将在https://github.com/vdigpku/iternet上找到。

Scene text recognition (STR) is a challenging problem due to the imperfect imagery conditions in natural images. State-of-the-art methods utilize both visual cues and linguistic knowledge to tackle this challenging problem. Specifically, they propose iterative language modeling module (IterLM) to repeatedly refine the output sequence from the visual modeling module (VM). Though achieving promising results, the vision modeling module has become the performance bottleneck of these methods. In this paper, we newly propose iterative vision modeling module (IterVM) to further improve the STR accuracy. Specifically, the first VM directly extracts multi-level features from the input image, and the following VMs re-extract multi-level features from the input image and fuse them with the high-level (i.e., the most semantic one) feature extracted by the previous VM. By combining the proposed IterVM with iterative language modeling module, we further propose a powerful scene text recognizer called IterNet. Extensive experiments demonstrate that the proposed IterVM can significantly improve the scene text recognition accuracy, especially on low-quality scene text images. Moreover, the proposed scene text recognizer IterNet achieves new state-of-the-art results on several public benchmarks. Codes will be available at https://github.com/VDIGPKU/IterNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题