问题驱动的图形融合网络用于视觉问题回答

论文标题

问题驱动的图形融合网络用于视觉问题回答

Question-Driven Graph Fusion Network For Visual Question Answering

论文作者

Qian, Yuxi, Hu, Yuncong, Wang, Ruonan, Feng, Fangxiang, Wang, Xiaojie

论文摘要

现有的视觉问题回答（VQA）模型探索了图像中对象之间的各种视觉关系，以回答复杂的问题，这不可避免地引入了不准确的对象检测和文本接地带来的无关信息。为了解决问题，我们提出了一个问题驱动的图形融合网络（QD-GFN）。它首先通过三个图表网络在图像中的语义，空间和隐式视觉关系进行建模，然后利用问题信息来指导三个图的聚合过程，此外，我们的QD-GFN采用了对象滤波机制来删除图像中包含的质疑对象。实验结果表明，我们的QD-GFN在VQA 2.0和VQA-CP V2数据集上均优于先前的最新面积。进一步的分析表明，新型的图形聚集方法和对象滤波机制在改善模型的性能中起着重要作用。

Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, further, our QD-GFN adopts an object filtering mechanism to remove question-irrelevant objects contained in the image. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows that both the novel graph aggregation method and object filtering mechanism play a significant role in improving the performance of the model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题