视觉问题回答的文本感知双路由网络

论文标题

视觉问题回答的文本感知双路由网络

Text-Aware Dual Routing Network for Visual Question Answering

论文作者

Jiang, Luoqian, He, Yifan, Chen, Jian

论文摘要

视觉问题回答（VQA）是一项具有挑战性的任务，可以给出图像的准确自然语言答案和关于图像的自然语言问题。它涉及多模式学习，即计算机视觉（CV）和自然语言处理（NLP），以及自由形式和开放式答案的灵活答案预测。现有方法通常在需要阅读和理解图像中的文本以回答问题的情况下失败。在实践中，他们无法有效处理从文本令牌中得出的答案序列，因为视觉特征不是面向文本的。为了解决上述问题，我们提出了一个文本感知的双路由网络（TDR），该网络同时处理VQA案例，无论有没有了解输入图像中的文本信息。具体来说，我们构建了一个两分支答案的预测网络，该网络包含每个情况的特定分支，并进一步开发双路由方案，以动态确定应选择哪个分支。在涉及文本理解的分支中，我们将光学特征识别（OCR）特征纳入模型中，以帮助理解图像中的文本。 VQA V2.0数据集的广泛实验表明，我们提出的TDR优于现有方法，尤其是在“数字”相关的VQA问题上。

Visual question answering (VQA) is a challenging task to provide an accurate natural language answer given an image and a natural language question about the image. It involves multi-modal learning, i.e., computer vision (CV) and natural language processing (NLP), as well as flexible answer prediction for free-form and open-ended answers. Existing approaches often fail in cases that require reading and understanding text in images to answer questions. In practice, they cannot effectively handle the answer sequence derived from text tokens because the visual features are not text-oriented. To address the above issues, we propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images. Specifically, we build a two-branch answer prediction network that contains a specific branch for each case and further develop a dual routing scheme to dynamically determine which branch should be chosen. In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images. Extensive experiments on the VQA v2.0 dataset demonstrate that our proposed TDR outperforms existing methods, especially on the ''number'' related VQA questions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题