学习内容和语言偏见以视觉问题回答

论文标题

学习内容和语言偏见以视觉问题回答

Learning content and context with language bias for Visual Question Answering

论文作者

Yang, Chao, Feng, Su, Li, Dongsheng, Shen, Huawei, Wang, Guoqing, Jiang, Bin

论文摘要

视觉问题回答（VQA）是回答有关图像问题的具有挑战性的多模式任务。许多作品集中于如何减少语言偏见，这使模型回答忽略视觉内容和语言环境的问题。但是，减少语言偏见也削弱了VQA模型在先前学习上下文的能力。为了解决这个问题，我们提出了一种名为CCB的新颖学习策略，该策略迫使VQA模型回答依赖于内容和语言偏见的内容的问题。具体而言，CCB在基本VQA模型之上建立内容和上下文分支，并迫使他们分别专注于本地关键内容和全球有效环境。此外，提出了联合损失功能，以减少偏见样本的重要性，并保留其对回答问题的有益影响。实验表明，在VQA-CP V2上的准确性方面，CCB的表现优于最新方法。

Visual Question Answering (VQA) is a challenging multimodal task to answer questions about an image. Many works concentrate on how to reduce language bias which makes models answer questions ignoring visual content and language context. However, reducing language bias also weakens the ability of VQA models to learn context prior. To address this issue, we propose a novel learning strategy named CCB, which forces VQA models to answer questions relying on Content and Context with language Bias. Specifically, CCB establishes Content and Context branches on top of a base VQA model and forces them to focus on local key content and global effective context respectively. Moreover, a joint loss function is proposed to reduce the importance of biased samples and retain their beneficial influence on answering questions. Experiments show that CCB outperforms the state-of-the-art methods in terms of accuracy on VQA-CP v2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题