忘记和失衡在机器人终身学习中使用违反政策数据

论文标题

忘记和失衡在机器人终身学习中使用违反政策数据

Forgetting and Imbalance in Robot Lifelong Learning with Off-policy Data

论文作者

Zhou, Wenxuan, Bohez, Steven, Humplik, Jan, Abdolmaleki, Abbas, Rao, Dushyant, Wulfmeier, Markus, Haarnoja, Tuomas, Heess, Nicolas

论文摘要

机器人将在整个生命周期内都会经历非平稳环境动态：机器人动态会因磨损而改变，或者周围环境可能会随着时间而改变。最终，机器人在遇到的所有环境变化中都应表现良好。同时，它仍然应该能够在新环境中快速学习。在这样的终生学习环境中，我们确定了两种挑战（RL），并使用额外的数据来确定：首先，现有的非政策算法在保持旧环境中保持良好绩效和在新环境中保持良好的良好表现之间的权衡，尽管将所有数据保留在重播缓冲区中。我们提出了离线蒸馏管道，以通过将培训程序分离为在线互动阶段和离线蒸馏阶段来打破这一权衡。第二，我们发现，与整个一生中多个环境的不平衡非政策数据进行不平衡的培训会导致大量的性能下降。我们确定这种性能下降是由数据集中质量不平衡和大小的组合引起的，这加剧了Q功能的外推误差。在蒸馏阶段，我们通过使策略更接近生成数据的行为策略来应用一个简单的解决方案。在实验中，我们在各种环境变化中通过模拟的两足机器人步行任务证明了这两个挑战和提议的解决方案。我们表明，离线蒸馏管线在所有遇到的环境中都能取得更好的性能，而不会影响数据收集。我们还提供了一项全面的经验研究，以支持我们对数据不平衡问题的假设。

Robots will experience non-stationary environment dynamics throughout their lifetime: the robot dynamics can change due to wear and tear, or its surroundings may change over time. Eventually, the robots should perform well in all of the environment variations it has encountered. At the same time, it should still be able to learn fast in a new environment. We identify two challenges in Reinforcement Learning (RL) under such a lifelong learning setting with off-policy data: first, existing off-policy algorithms struggle with the trade-off between being conservative to maintain good performance in the old environment and learning efficiently in the new environment, despite keeping all the data in the replay buffer. We propose the Offline Distillation Pipeline to break this trade-off by separating the training procedure into an online interaction phase and an offline distillation phase.Second, we find that training with the imbalanced off-policy data from multiple environments across the lifetime creates a significant performance drop. We identify that this performance drop is caused by the combination of the imbalanced quality and size among the datasets which exacerbate the extrapolation error of the Q-function. During the distillation phase, we apply a simple fix to the issue by keeping the policy closer to the behavior policy that generated the data. In the experiments, we demonstrate these two challenges and the proposed solutions with a simulated bipedal robot walk-ing task across various environment changes. We show that the Offline Distillation Pipeline achieves better performance across all the encountered environments without affecting data collection. We also provide a comprehensive empirical study to support our hypothesis on the data imbalance issue.

下载PDF全文

下载文献需遵守相关版权规定

论文标题