PBCS：使用强化学习与运动计划之间的协同作用有效探索和剥削

论文标题

PBCS：使用强化学习与运动计划之间的协同作用有效探索和剥削

PBCS : Efficient Exploration and Exploitation Using a Synergy between Reinforcement Learning and Motion Planning

论文作者

Matheron, Guillaume, Perrin, Nicolas, Sigaud, Olivier

论文摘要

探索解释权取舍是加强学习（RL）的核心。但是，最近RL研究中使用的大多数连续控制基准只需要局部探索。这导致了具有基本探索功能的算法的发展，并且在需要更多用途探索的基准中表现不佳。例如，正如我们的实证研究所证明的那样，诸如DDPG和TD3之类的最先进的RL算法也无法在小型2D迷宫中引导点质量。在本文中，我们提出了一种称为“计划，反向，链技巧”（PBC）的新算法，将运动计划和强化学习结合在一起，以解决硬探索环境。在第一阶段，运动计划算法用于找到单个良好的轨迹，然后使用源自该轨迹的课程来训练RL算法，通过结合反向游戏算法和技能链条的变体。我们表明，在2D迷宫环境中，这种方法优于最先进的RL算法，并且能够改善运动计划阶段获得的轨迹。

The exploration-exploitation trade-off is at the heart of reinforcement learning (RL). However, most continuous control benchmarks used in recent RL research only require local exploration. This led to the development of algorithms that have basic exploration capabilities, and behave poorly in benchmarks that require more versatile exploration. For instance, as demonstrated in our empirical study, state-of-the-art RL algorithms such as DDPG and TD3 are unable to steer a point mass in even small 2D mazes. In this paper, we propose a new algorithm called "Plan, Backplay, Chain Skills" (PBCS) that combines motion planning and reinforcement learning to solve hard exploration environments. In a first phase, a motion planning algorithm is used to find a single good trajectory, then an RL algorithm is trained using a curriculum derived from the trajectory, by combining a variant of the Backplay algorithm and skill chaining. We show that this method outperforms state-of-the-art RL algorithms in 2D maze environments of various sizes, and is able to improve on the trajectory obtained by the motion planning phase.

下载PDF全文

下载文献需遵守相关版权规定

论文标题