COPS-REF：关于组成参考表达理解的新数据集和任务

论文标题

COPS-REF：关于组成参考表达理解的新数据集和任务

Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

论文作者

Chen, Zhenfang, Wang, Peng, Ma, Lin, Wong, Kwan-Yee K., Wu, Qi

论文摘要

参考表达理解（REF）旨在通过自然语言表达方式在场景中识别特定对象。它需要在文本和视觉域上进行联合推理，以解决问题。但是，一些流行的转介表达数据集未能提供理想的测试床来评估模型的推理能力，这主要是因为1）它们的表达通常仅描述对象的一些简单独特属性，而2）他们的图像包含有限的分散注意力信息。为了弥合差距，我们在带有两个主要功能的表达理解的上下文中提出了一个新数据集，用于视觉推理。首先，我们设计了一种新型的表达引擎，渲染各种推理逻辑，可以灵活地与丰富的视觉属性结合，以生成具有不同组成性的表达式。其次，为了更好地利用表达式中体现的完整推理链，我们通过添加包含与参数共享相似属性的其他分心图像来提出一个新的测试设置，从而最大程度地降低了无推理跨域对准的成功率。我们评估了几种最先进的REF模型，但发现它们都无法实现有希望的表现。拟议的模块化硬采矿策略表现最好，但仍然留出了很大的改进空间。我们希望这个新的数据集和任务可以作为更深视觉推理分析的基准，并促进有关表达理解的研究。

Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning chain embodied in an expression, we propose a new test setting by adding additional distracting images containing objects sharing similar properties with the referent, thus minimising the success rate of reasoning-free cross-domain alignment. We evaluate several state-of-the-art REF models, but find none of them can achieve promising performance. A proposed modular hard mining strategy performs the best but still leaves substantial room for improvement. We hope this new dataset and task can serve as a benchmark for deeper visual reasoning analysis and foster the research on referring expression comprehension.

下载PDF全文

下载文献需遵守相关版权规定

论文标题