ABINET ++：场景文本斑点的自主，双向和迭代语言建模

论文标题

ABINET ++：场景文本斑点的自主，双向和迭代语言建模

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

论文作者

Fang, Shancheng, Mao, Zhendong, Xie, Hongtao, Wang, Yuxin, Yan, Chenggang, Zhang, Yongdong

论文摘要

由于其广泛的应用，场景文本斑点对计算机视觉社区至关重要。最近的方法试图引入语言知识，以挑战识别而不是纯粹的视觉分类。但是，如何在端到端深层网络中有效地对语言规则进行有效建模仍然是一项研究挑战。在本文中，我们认为语言模型的有限能力来自1）隐式语言建模； 2）单向特征表示； 3）具有噪声输入的语言模型。相应地，我们提出了一个自主，双向和迭代的ABINET ++，用于场景文本斑点。首先，自主建议通过将识别器解耦为视觉模型和语言模型，并阻止两个模型之间的梯度流来实施明确的语言建模。其次，基于双向特征表示，提出了一种新型的双向披肩网络（BCN）作为语言模型。第三，我们建议对语言模型进行迭代校正的执行方式，以有效地减轻噪声输入的影响。最后，要在长文本识别中抛光ABINET ++，我们建议通过将变压器单元嵌入U-NET内，并设计一个位置和内容注意模块来汇总水平特征，并设计一个集成了字符顺序和内容以精确地参与角色特征的位置和内容。 ABINET ++在场景文本识别和场景文本发现基准测试方面都达到了最新的性能，这始终在各种环境中尤其是在低质量图像上展示我们方法的优越性。此外，包括英语和中文在内的广泛实验还证明，与常用的基于注意的基于注意的识别者相比，将我们的语言建模方法纳入我们的语言建模方法的文本点可以显着提高其准确性和速度的性能。

Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题