发现差异：在非完美共存场景中的合作对象引用游戏

论文标题

发现差异：在非完美共存场景中的合作对象引用游戏

Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene

论文作者

Zheng, Duo, Meng, Fandong, Si, Qingyi, Fan, Hairun, Xu, Zipeng, Zhou, Jie, Feng, Fangxiang, Wang, Xiaojie

论文摘要

在将各种面向视觉的目标引入对话之后，尤其是猜测和猜测，唯一的图像分别从问答器和答案者可以看到唯一的图像。研究人员在这种单一或完美共同观察的视觉场景中进行了更多的视觉对话任务探索，而在某种程度上却忽略了对非完全共同观察的视觉场景的探索，其中两个代理商访问的图像可能并不完全相同，通常是在实践中发生的。尽管通过对话在非完美共同观察的视觉场景中建立共同点对于高级对话代理来说是重要的，但是缺乏这种对话框任务和相应的大规模数据集，因此无法进行深入的研究。为了打破此限制，我们在非完美共同观察的视觉场景中提出了一个引用对象引用的游戏，其目标是通过以自然语言对话来发现相似的视觉场景之间的差异。该任务以非完美共同观察的视觉场景和对对象的能力分类的能力解决了对话框策略的挑战。相应地，我们构建了一个名为Spotdiff的大规模多模式数据集，其中包含87K虚拟现实图像和由自我播放生成的97K对话框。最后，我们为这项任务提供了基准模型，并进行了广泛的实验以评估其性能并分析其主要挑战。

Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation, especially such as GuessWhich and GuessWhat, where the only image is visible by either and both of the questioner and the answerer, respectively. Researchers explore more on visual dialog tasks in such kind of single- or perfectly co-observable visual scene, while somewhat neglect the exploration on tasks of non perfectly co-observable visual scene, where the images accessed by two agents may not be exactly the same, often occurred in practice. Although building common ground in non-perfectly co-observable visual scene through conversation is significant for advanced dialog agents, the lack of such dialog task and corresponding large-scale dataset makes it impossible to carry out in-depth research. To break this limitation, we propose an object-referring game in non-perfectly co-observable visual scene, where the goal is to spot the difference between the similar visual scenes through conversing in natural language. The task addresses challenges of the dialog strategy in non-perfectly co-observable visual scene and the ability of categorizing objects. Correspondingly, we construct a large-scale multimodal dataset, named SpotDiff, which contains 87k Virtual Reality images and 97k dialogs generated by self-play. Finally, we give benchmark models for this task, and conduct extensive experiments to evaluate its performance as well as analyze its main challenges.

下载PDF全文

下载文献需遵守相关版权规定

论文标题