论文标题
走向推理意识到可以解释的VQA
Towards Reasoning-Aware Explainable VQA
论文作者
论文摘要
联合视觉理解的领域,尤其是在视觉问题回答(VQA)模型中推理的背景下,在最近的过去引起了极大的关注。尽管大多数现有的VQA模型都致力于提高VQA的准确性,但模型的答案方式通常是黑匣子。作为使VQA任务更容易解释和可解释的一步,我们的方法是通过使用端到端的解释生成模块来增强SOTA VQA框架的。在本文中,我们研究了两个网络体系结构,包括长期短期内存(LSTM)和变压器解码器,作为解释生成器。我们的方法在GQA-Rex(77.49%)和VQA-E(71.48%)数据集上维护SOTA VQA准确性的同时,生成了人类可读的文本解释。大约65.16%的生成解释被人类批准为有效。生成的解释中约有60.5%有效,并导致正确的答案。
The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arrive at an answer is oftentimes a black box. As a step towards making the VQA task more explainable and interpretable, our method is built upon the SOTA VQA framework by augmenting it with an end-to-end explanation generation module. In this paper, we investigate two network architectures, including Long Short-Term Memory (LSTM) and Transformer decoder, as the explanation generator. Our method generates human-readable textual explanations while maintaining SOTA VQA accuracy on the GQA-REX (77.49%) and VQA-E (71.48%) datasets. Approximately 65.16% of the generated explanations are approved by humans as valid. Roughly 60.5% of the generated explanations are valid and lead to the correct answers.