一条新路径：通过合成指令和模仿学习扩展视觉和语言导航

论文标题

一条新路径：通过合成指令和模仿学习扩展视觉和语言导航

A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

论文作者

Kamath, Aishwarya, Anderson, Peter, Wang, Su, Koh, Jing Yu, Ku, Alexander, Waters, Austin, Yang, Yinfei, Baldridge, Jason, Parekh, Zarana

论文摘要

视力和语言导航（VLN）训练RL代理的最新研究以在逼真的环境中执行自然语言导航指令，这是朝着可以遵循人类指示的机器人迈出的一步。但是，鉴于人类教学数据的稀缺性和在培训环境中的多样性有限，这些代理仍然在复杂的语言基础和空间语言理解方面挣扎。对网络的大型文本和图像文本数据集进行了预处理，已经进行了广泛的探索，但改进是有限的。我们通过合成指令研究大规模的增强。我们采用500多个以密集采样的360度全景捕获的室内环境，通过这些全景构建导航轨迹，并使用Marky（Marky）（高质量的多项式导航指令生成器Marky）为每个轨迹生成视觉上的指令。我们还使用图像到图像gan从新颖的角度综合了图像观测。所得的420万个指令 - 标题对的数据集比现有的人类注销数据集大两个数量级，并且包含多种多样的环境和观点。为了在此规模上有效利用数据，我们通过模仿学习训练一个简单的变压器代理。在具有挑战性的RXR数据集上，我们的方法的表现优于所有现有的RL代理，在可见环境中，最先进的NDTW从71.1提高到79.1，而在看不见的测试环境中，NDTW从64.6到66.8。我们的工作指出了提高指导跟随代理的新途径，强调大规模的模仿学习和合成指导生成能力的发展。

Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题