稳定的政策优化通过非政策差异正则化

论文标题

稳定的政策优化通过非政策差异正则化

Stable Policy Optimization via Off-Policy Divergence Regularization

论文作者

Touati, Ahmed, Zhang, Amy, Pineau, Joelle, Vincent, Pascal

论文摘要

信任区域政策优化（TRPO）和近端政策优化（PPO）是深度强化学习（RL）中最成功的政策梯度方法之一。尽管这些方法在各种具有挑战性的任务中实现了最先进的绩效，但稳定策略学习以及如何使用非政策数据的稳定空间。在本文中，我们重新审视了这些算法的理论基础，并提出了一种新算法，该算法通过接近性术语来稳定政策改进，该术语限制了连续政策引起的折现州行动访问分布，以使其彼此接近。这个接近性术语以探视分布之间的差异表示，以非政策和对抗性方式学习。我们从经验上表明，我们提出的方法可以对稳定性产生有益的影响，并提高基准高维控制任务的最终性能。

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题