论文标题

部分可观测时空混沌系统的无模型预测

SQA3D: Situated Question Answering in 3D Scenes

论文作者

Ma, Xiaojian, Yong, Silong, Zheng, Zilong, Li, Qing, Liang, Yitao, Zhu, Song-Chun, Huang, Siyuan

论文摘要

我们提出了一项新任务,以基于对体现代理的理解理解:在3D场景(SQA3D)中回答的位置问题。给定场景上下文(例如3D扫描),SQA3D要求测试的代理首先了解文本所述的3D场景中的情况(位置,方向等),然后是关于其周围环境的原因,并在这种情况下回答问题。根据扫描仪的650个场景,我们提供了一个以6.8k唯一情况为中心的数据集,以及20.4k的描述和33.4k的不同推理问题。这些问题研究了智能代理的广泛推理能力,从空间关系理解到平价理解,导航和多跳推理。 SQA3D对当前的多模式特别是3D推理模型构成了重大挑战。我们评估了各种最先进的方法,发现最好的方法只能达到47.20%,而业余人类参与者可以达到90.06%。我们认为SQA3D可以促进未来体现的AI研究,并具有更强的处境理解和推理能力。

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源