反事实样品合成可靠的视觉问题回答

论文标题

反事实样品合成可靠的视觉问题回答

Counterfactual Samples Synthesizing for Robust Visual Question Answering

论文作者

Chen, Long, Yan, Xin, Xiao, Jun, Zhang, Hanwang, Pu, Shiliang, Zhuang, Yueting

论文摘要

尽管有视觉询问回答（VQA）在过去几年中已经实现了令人印象深刻的进步，但当今的VQA模型倾向于捕获火车集中的浅表语言相关性，并且未能推广到具有不同QA分布的测试集。为了减少语言偏见，最近的几项著作引入了一个仅辅助问题模型，以正规化目标VQA模型的培训，并在VQA-CP上实现主导性能。但是，由于设计的复杂性，当前的方法无法为基于整体的模型配备具有理想VQA模型的两个必不可少的特征：1）可视化：该模型应在做出决策时依靠正确的视觉区域。 2）问题敏感：该模型应对所讨论的语言变化敏感。为此，我们提出了一种模型反应反事实样品合成（CSS）培训方案。 CSS通过掩盖问题中图像或单词中的关键对象并分配不同的地面真实答案来生成许多反事实培训样本。在使用互补样品（即原始样品和生成样本）培训后，VQA模型被迫专注于所有关键对象和单词，这显着提高了可见的和问题敏感的能力。作为回报，这些模型的性能得到了进一步的提高。广泛的消融表明CSS的有效性。特别是，通过在LMH型号的基础上构建，我们在VQA-CP V2上获得了58.95％的创纪录性能，增长了6.5％。

Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题