查找证据：本地化感知的答案，用于文本视觉问题答案

论文标题

查找证据：本地化感知的答案，用于文本视觉问题答案

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

论文作者

Han, Wei, Huang, Hantao, Han, Tao

论文摘要

图像文本包含基本信息以了解场景并执行推理。基于文本的视觉问题回答（文本VQA）任务着重于需要在图像中读取文本的视觉问题。现有的文本VQA系统通过从光学字符识别（OCR）文本或固定词汇中进行选择来生成答案。文本的位置信息未使用，并且缺乏生成答案的证据。因此，本文提出了一个本地化感知的答案预测网络（LAAP-NET），以应对这一挑战。我们的LAAP网络不仅为问题产生答案，还可以预测一个边界框，作为生成答案的证据。此外，提出了用于多模式融合的富含上下文的OCR表示（COR），以促进本地化任务。我们提出的LAAP-NET在三个基准数据集上的现有方法以明显的边距为文本VQA任务。

Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题