如何度过机器人时间：基于视觉机器人操纵的桥接开始和离线增强学习

论文标题

如何度过机器人时间：基于视觉机器人操纵的桥接开始和离线增强学习

How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation

论文作者

Lee, Alex X., Devin, Coline, Springenberg, Jost Tobias, Zhou, Yuxiang, Lampe, Thomas, Abdolmaleki, Abbas, Bousmalis, Konstantinos

论文摘要

增强学习（RL）已被证明可以有效地从经验中学习控制。但是，RL通常需要与环境进行大量在线互动。这将其适用性限制在现实环境中，例如在机器人技术中，这种交互很昂贵。在这项工作中，我们调查了通过重复次优政策来最大程度地减少目标任务中的在线互动的方法，例如，通过培训相关的先前任务或模拟中的培训。为此，我们开发了两种RL算法，这些算法不仅可以使用教师政策的动作分布来加快培训，还可以使用此类策略在手头的任务上收集的数据。我们对如何使用次优教师在具有不同物体的基于视觉堆叠的机器人操纵基准上进行彻底的实验研究。我们将我们的方法比较脱机，在线，离线到对线和启动RL算法的方法。通过这样做，我们发现教师和学生的数据培训可以为有限的数据预算提供最佳性能。我们研究了如何在教师和学生政策之间最好地分配有限的数据预算（关于目标任务），并使用不同的预算，两名具有不同程度的次级临时性的教师报告实验，以及需要五个需要各种行为的堆叠任务。在模拟和现实世界中，我们的分析表明，我们的方法是跨数据预算最好的方法，而在给出足够的数据时，教师推出的标准离线RL令人惊讶地有效。

Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a suboptimal policy we might have access to, for example from training on related prior tasks, or in simulation. To this end, we develop two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand. We conduct a thorough experimental study of how to use suboptimal teachers on a challenging robotic manipulation benchmark on vision-based stacking with diverse objects. We compare our methods to offline, online, offline-to-online, and kickstarting RL algorithms. By doing so, we find that training on data from both the teacher and student, enables the best performance for limited data budgets. We examine how to best allocate a limited data budget -- on the target task -- between the teacher and the student policy, and report experiments using varying budgets, two teachers with different degrees of suboptimality, and five stacking tasks that require a diverse set of behaviors. Our analysis, both in simulation and in the real world, shows that our approach is the best across data budgets, while standard offline RL from teacher rollouts is surprisingly effective when enough data is given.

下载PDF全文

下载文献需遵守相关版权规定

论文标题