论文标题
超越VQA:为视觉问题生成多词的答案和理由
Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions
论文作者
论文摘要
视觉问题回答是一项旨在衡量高级视觉理解的多模式任务。当代VQA模型在某种意义上是限制性的,因为答案是通过有限的词汇(在开放式VQA的情况下)或通过一组多项选择性型答案获得的。在这项工作中,我们提出了一种完全生成的公式,其中为视觉查询生成了多字答案。为了向前迈出一步,我们介绍了一项新任务:viqar(视觉问题答案和推理),其中,其中一个模型必须生成完整的答案和旨在证明生成的答案合理的理由。我们提出了一个端到端体系结构来解决此任务并描述如何评估它。我们表明,我们的模型通过定性和定量评估以及人类的图灵测试产生了强大的答案和理由。
Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-word answer is generated for a visual query. To take this a step forward, we introduce a new task: ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer. We propose an end-to-end architecture to solve this task and describe how to evaluate it. We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.