蒙面的视觉预训练以进行运动控制

论文标题

蒙面的视觉预训练以进行运动控制

Masked Visual Pre-training for Motor Control

论文作者

Xiao, Tete, Radosavovic, Ilija, Darrell, Trevor, Malik, Jitendra

论文摘要

本文表明，来自现实世界图像的自我监督的视觉预训练对于从像素学习运动控制任务有效。我们首先通过自然图像的掩盖建模来训练视觉表示。然后，我们通过增强学习将视觉编码器和训练神经网络控制器冻结。我们没有对编码器执行任何特定于任务的微调；所有电动机控制任务都使用相同的视觉表示形式。据我们所知，这是第一个自制模型，用于大规模利用现实世界图像进行电机控制。为了加速从像素学习的进步，我们为动作，场景和机器人而变化的手工设计的任务做出了基准套件。在不依赖标签，州估计或专家演示的情况下，我们始终以高达80％的绝对成功率优于监督编码器，有时甚至与Oracle状态绩效相匹配。我们还发现，与Imagenet图像相比，从YouTube或以自我为中心的视频中，从YouTube或以自我为中心的视频中，可以为各种操纵任务提供更好的视觉表示。

This paper shows that self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We first train the visual representations by masked modeling of natural images. We then freeze the visual encoder and train neural network controllers on top with reinforcement learning. We do not perform any task-specific fine-tuning of the encoder; the same visual representations are used for all motor control tasks. To the best of our knowledge, this is the first self-supervised model to exploit real-world images at scale for motor control. To accelerate progress in learning from pixels, we contribute a benchmark suite of hand-designed tasks varying in movements, scenes, and robots. Without relying on labels, state-estimation, or expert demonstrations, we consistently outperform supervised encoders by up to 80% absolute success rate, sometimes even matching the oracle state performance. We also find that in-the-wild images, e.g., from YouTube or Egocentric videos, lead to better visual representations for various manipulation tasks than ImageNet images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题