论文标题

稳定的政策优化通过非政策差异正则化

Stable Policy Optimization via Off-Policy Divergence Regularization

论文作者

Touati, Ahmed, Zhang, Amy, Pineau, Joelle, Vincent, Pascal

论文摘要

信任区域政策优化(TRPO)和近端政策优化(PPO)是深度强化学习(RL)中最成功的政策梯度方法之一。尽管这些方法在各种具有挑战性的任务中实现了最先进的绩效,但稳定策略学习以及如何使用非政策数据的稳定空间。在本文中,我们重新审视了这些算法的理论基础,并提出了一种新算法,该算法通过接近性术语来稳定政策改进,该术语限制了连续政策引起的折现州行动访问分布,以使其彼此接近。这个接近性术语以探视分布之间的差异表示,以非政策和对抗性方式学习。我们从经验上表明,我们提出的方法可以对稳定性产生有益的影响,并提高基准高维控制任务的最终性能。

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源