chiqa：一个基于图像的大型现实世界问题回答数据集，以用于多模式理解

论文标题

chiqa：一个基于图像的大型现实世界问题回答数据集，以用于多模式理解

ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

论文作者

Wang, Bingning, Lv, Feiyang, Yao, Ting, Yuan, Yiming, Ma, Jin, Luo, Yu, Liang, Haijin

论文摘要

视觉问题回答是自然语言和愿景理解的重要任务。但是，在大多数公众视觉问题上回答了诸如VQA，CLEVR之类的数据集，这些问题是针对给定图像的特定于“她的眼睛是什么颜色？”的人类产生的。人类产生的众包问题相对简单，有时对某些实体或属性有偏见。在本文中，我们介绍了一个基于图像chiqa的新问题回答数据集。它包含Internet用户发布的现实查询，并结合了几个相关的开放域图像。系统应确定图像是否可以回答问题。与以前的VQA数据集不同，这些问题是现实世界中独立的查询，这些查询更加各种和无偏见。与先前的图像回程或图像捕获数据集相比，Chiqa不仅可以衡量相关性，而且还可以衡量答案性，这需要更细粒度的视力和语言推理。 Chiqa包含超过40k的问题，超过200K的问题图像对。将三级2/1/0标签分配给每个对表明完美答案，部分答案和无关紧要的标签。数据分析表明，Chiqa需要对语言和视觉有深入的了解，包括接地，比较和阅读。我们评估了几种最先进的视觉语言模型，例如ALBEF，这表明仍然有一个很大的改进奇卡的空间。

Visual question answering is an important task in both natural language and vision understanding. However, in most of the public visual question answering datasets such as VQA, CLEVR, the questions are human generated that specific to the given image, such as `What color are her eyes?'. The human generated crowdsourcing questions are relatively simple and sometimes have the bias toward certain entities or attributes. In this paper, we introduce a new question answering dataset based on image-ChiQA. It contains the real-world queries issued by internet users, combined with several related open-domain images. The system should determine whether the image could answer the question or not. Different from previous VQA datasets, the questions are real-world image-independent queries that are more various and unbiased. Compared with previous image-retrieval or image-caption datasets, the ChiQA not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. ChiQA contains more than 40K questions and more than 200K question-images pairs. A three-level 2/1/0 label is assigned to each pair indicating perfect answer, partially answer and irrelevant. Data analysis shows ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading. We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题