论文标题
现实世界的机器人学习,蒙版的视觉预训练
Real-World Robot Learning with Masked Visual Pre-training
论文作者
论文摘要
在这项工作中,我们探索了自我监督的视觉预训练,这些预训练来自来自不同野外机器人任务的野外视频的图像。像先前的工作一样,我们的视觉表示通过胶带自动编码器(MAE)进行预训练,然后冷冻,然后传递到可学习的控制模块中。与先前的工作不同,我们表明预培训的表示形式在一系列现实的机器人任务和实施方案中都是有效的。我们发现,我们的编码器始终超过剪辑(高达75%),有监督的Imagenet预训练(高达81%),并从头开始培训(高达81%)。最后,我们训练3.07亿个参数视觉变压器,从Internet和以自我为中心的视频中的450万张图像集合中进行培训,并清楚地证明了对机器人学习的视觉预训练的好处。
In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.