论文标题
告诉我证据?双重视觉语言互动用于答案接地
Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding
论文作者
论文摘要
答案接地旨在揭示视觉问题答案(VQA)的视觉证据,这需要在回答有关图像的问题时突出图像中的相关位置。以前的尝试通常使用验证的对象探测器解决此问题,但没有在预定义词汇中的对象的灵活性。但是,这些黑盒方法仅集中在语言生成上,忽略了视觉解释性。在本文中,我们提出了双重视觉语言互动(DAVI),这是一个新型的统一端到端框架,具有语言答案和视觉接地的能力。 DAVI创新地介绍了两种视觉语言交互机制:1)基于视觉的语言编码器,这些语言编码器理解与视觉特征合并的问题,并产生面向语言的证据以进行进一步的答案解码,2)基于语言的视觉解码器,将视觉特征焦点放在与证据相关的区域上,以进行答案。这样,我们的方法将2022年Vizwiz Grand Challenge的答案接地中排名第一。
Answer grounding aims to reveal the visual evidence for visual question answering (VQA), which entails highlighting relevant positions in the image when answering questions about images. Previous attempts typically tackle this problem using pretrained object detectors, but without the flexibility for objects not in the predefined vocabulary. However, these black-box methods solely concentrate on the linguistic generation, ignoring the visual interpretability. In this paper, we propose Dual Visual-Linguistic Interaction (DaVI), a novel unified end-to-end framework with the capability for both linguistic answering and visual grounding. DaVI innovatively introduces two visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder that understands questions incorporated with visual features and produces linguistic-oriented evidence for further answer decoding, and 2) linguistic-based visual decoder that focuses visual features on the evidence-related regions for answer grounding. This way, our approach ranked the 1st place in the answer grounding track of 2022 VizWiz Grand Challenge.