动态键值内存增强了基于知识的视觉问题的多步图推理

论文标题

动态键值内存增强了基于知识的视觉问题的多步图推理

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering

论文作者

Li, Mingxiao, Moens, Marie-Francine

论文摘要

基于知识的视觉问题回答（VQA）是一项视力语言任务，需要代理使用给定图像中未介绍的知识正确回答与图像相关的问题。这不仅比常规VQA更具挑战性的任务，而且是建立一般VQA系统的至关重要的一步。大多数现有的基于知识的VQA系统处理知识和图像信息类似，而忽略了知识库（KB）包含有关三重态的完整信息，而提取的图像信息可能是不完整的，因为缺少两个对象或错误地检测到的两个对象之间的关系。在本文中，我们提出了一个名为“动态知识记忆”增强的新型模型，增强了多步图推理（DMMGR），该模型分别对键值知识记忆模块和空间感知的图像图执行明确和隐式推理。具体而言，内存模块学习动态知识表示，并在每个推理步骤中生成知识感知的问题表示。然后，该表示形式用于通过空间感知图像图指导图形注意操作员。我们的模型在KRVQR和FVQA数据集上实现了新的最新精度。我们还进行消融实验，以证明所提出模型的每个组件的有效性。

Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image. It is not only a more challenging task than regular VQA but also a vital step towards building a general VQA system. Most existing knowledge-based VQA systems process knowledge and image information similarly and ignore the fact that the knowledge base (KB) contains complete information about a triplet, while the extracted image information might be incomplete as the relations between two objects are missing or wrongly detected. In this paper, we propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively. Specifically, the memory module learns a dynamic knowledge representation and generates a knowledge-aware question representation at each reasoning step. Then, this representation is used to guide a graph attention operator over the spatial-aware image graph. Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the effectiveness of each component of the proposed model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题