OOV-STR的视觉自适应相互解码器

论文标题

OOV-STR的视觉自适应相互解码器

Vision-Language Adaptive Mutual Decoder for OOV-STR

论文作者

Hu, Jinshui, Liu, Chenyu, Yan, Qiandong, Zhu, Xuyang, Wu, Jiajia, Du, Jun, Dai, Lirong

论文摘要

最近的作品显示了深度学习模型在词汇（IV）场景文本识别中的巨大成功。但是，在现实情况下，播音外（OOV）单词非常重要，SOTA识别模型通常在OOV设置上表现较差。受到直觉的启发，即先前学先的语言的预性限制，我们设计了一个名为“视觉语言自适应互相解码器”（VLAMD）的框架，以部分解决OOV问题。 VLAMD由三个主要插件组成。首先，我们建立了一个基于注意力的LSTM解码器，具有两个适应性合并的仅视觉模块，产生了视觉平衡的主分支。其次，我们添加了一个基于辅助查询的自动回归变压器解码头，以进行通用的视觉和语言先验表示学习。最后，我们将这两种设计与双向培训相结合，以进行更多样化的语言建模，并进行相互的顺序解码以获得强烈的结果。我们的方法在IV+OOV和OOV设置上分别实现了70.31 \％和59.61 \％单词的准确性，分别在ECCV 2022 TIE TIE Workshop上的OOV-ST挑战的裁剪单词识别任务上，我们在这两个设置上都获得了第一名。

Recent works have shown huge success of deep learning models for common in vocabulary (IV) scene text recognition. However, in real-world scenarios, out-of-vocabulary (OOV) words are of great importance and SOTA recognition models usually perform poorly on OOV settings. Inspired by the intuition that the learned language prior have limited OOV preformence, we design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle OOV problems partly. VLAMD consists of three main conponents. Firstly, we build an attention based LSTM decoder with two adaptively merged visual-only modules, yields a vision-language balanced main branch. Secondly, we add an auxiliary query based autoregressive transformer decoding head for common visual and language prior representation learning. Finally, we couple these two designs with bidirectional training for more diverse language modeling, and do mutual sequential decoding to get robuster results. Our approach achieved 70.31\% and 59.61\% word accuracy on IV+OOV and OOV settings respectively on Cropped Word Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop, where we got 1st place on both settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题