通过视觉模型通过教学增强来获取机器人技能

论文标题

通过视觉模型通过教学增强来获取机器人技能

Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

论文作者

Xiao, Ted, Chan, Harris, Sermanet, Pierre, Wahid, Ayzaan, Brohan, Anthony, Hausman, Karol, Levine, Sergey, Tompson, Jonathan

论文摘要

近年来，在学习遵循自然语言说明的机器人操纵政策方面取得了很多进展。这种方法通常从机器人语言数据的语料库中学习，这些数据是考虑到特定任务，或者是由人类在事后看来具有丰富语言描述的人类重新标记的。最近，已将大规模的视觉语言模型（VLM）（例如剪辑或VILD）应用于机器人技术，以用于学习表示和场景描述符。这些预处理的模型是否可以用作机器人数据的自动标签，有效地将Internet规模的知识导入现有数据集中，以使其即使在其地面真相注释中不反映的任务中也有用？为了实现这一目标，我们引入了数据驱动的语言条件控制控制（拨号）：我们利用半监督语言标签利用了对剪辑的语义理解来传播知识的知识，然后在未标记的演示数据的大型数据集中，然后在增强数据集中培训语言条件的政策。与昂贵的人类标签相比，此方法可以更便宜地获取有用的语言描述，从而可以更有效地对大型数据集进行标签覆盖。我们将拨盘应用于具有挑战性的现实机器人操纵域，其中80,000个示威活动中有96.5％不包含众包语言注释。 Dial使模仿学习政策能够获取新功能并概括为原始数据集中看不见的60个新颖说明。

In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like CLIP or ViLD have been applied to robotics for learning representations and scene descriptors. Can these pretrained models serve as automatic labelers for robot data, effectively importing Internet-scale knowledge into existing datasets to make them useful even for tasks that are not reflected in their ground truth annotations? To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations do not contain crowd-sourced language annotations. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题