ZSON：使用多模式目标嵌入的零射击对象目标导航

论文标题

ZSON：使用多模式目标嵌入的零射击对象目标导航

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

论文作者

Majumdar, Arjun, Aggarwal, Gunjan, Devnani, Bhavika, Hoffman, Judy, Batra, Dhruv

论文摘要

我们提出了一种可扩展的方法，用于学习开放世界对象目标导航（ObjectNav） - 要求虚拟机器人（代理）在未探索的环境中找到对象的任何实例（例如，“查找接收器”）。我们的方法完全是零射击 - 即，它不需要任何形式的objectnav奖励或演示。取而代之的是，我们训练图像目标导航（Imagenav）任务，在该任务中，代理在其中找到捕获图片（即目标图像）的位置。具体而言，我们将目标图像编码为多模式的语义嵌入空间，以使训练在未注释的3D环境（例如HM3D）中以训练语义 - 目标导航（Semanticnav）代理。训练后，可以指示Semanticnav代理找到以自由形式的自然语言描述的对象（例如“接收器”，“浴室水槽”等），通过将语言目标投射到相同的多模式，语义嵌入空间中。结果，我们的方法启用了开放世界的ObjectNav。我们在三个ObjectNAV数据集（Gibson，HM3D和MP3D）上广泛评估了我们的代理商，并观察到成功的4.2％-20.0％的绝对改善。作为参考，这些收益与2020年至2021年Objectnav挑战赛挑战者之间成功的5％改善相似或更好。在开放世界的环境中，我们发现我们的代理商可以概括地通过明确提到的房间（例如，“找到厨房的水槽”）进行复合说明，并且何时可以推断目标室（例如，“找到水槽和炉子”）。

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

下载PDF全文

下载文献需遵守相关版权规定

论文标题