知道较早的正确对您意味着什么：一个全面的VQA数据集，用于通过多任务学习接地相对方向

论文标题

知道较早的正确对您意味着什么：一个全面的VQA数据集，用于通过多任务学习接地相对方向

Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning

论文作者

Ahrens, Kyra, Kerzel, Matthias, Lee, Jae Hee, Weber, Cornelius, Wermter, Stefan

论文摘要

空间推理对智能代理人构成了一个特殊的挑战，同时是他们在物理世界中成功互动和交流的先决条件。这样的推理任务之一是描述目标对象在通过相对方向的某些参考对象的固有方向方面的位置。在本文中，我们介绍了基于抽象对象的新型诊断视觉询问（VQA）数据集。我们的数据集允许对端到端VQA模型对地面相对方向的功能进行细粒度分析。同时，与现有数据集相比，模型培训需要少得多的计算资源，但产生了可比甚至更高的性能。除了新的数据集外，我们还基于在Grid-A-3D训练的两个端到端的端到端VQA架构进行详尽的评估。我们证明，在几个时期内，以相对方向进行推理所需的子任务，例如在场景中识别和定位对象并估算其内在方向，以直观的方式处理相对方向。

Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the order in which relative directions are intuitively processed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题