策略梯度方法的操作员视图

论文标题

策略梯度方法的操作员视图

An operator view of policy gradient methods

论文作者

Ghosh, Dibya, Machado, Marlos C., Roux, Nicolas Le

论文摘要

我们将策略梯度方法作为两个操作员的重复应用：一个策略改进运算符$ \ Mathcal {i} $，将任何策略$π$映射到一个更好的一个$ \ Mathcal {i}π$，以及一个投影操作员$ \ \ nathcal {p} $，找到了$ Mather $ \ iiz $ \ iriz}的设置。我们使用此框架来介绍基于操作的传统策略梯度方法（例如Readforce和PPO）的版本，从而可以更好地了解其原始对应物。我们还利用对$ \ Mathcal {i} $和$ \ Mathcal {p} $的作用的理解来提出预期回报的新全局下限。这种新的观点使我们能够进一步弥合基于策略的方法和基于价值的方法之间的差距，以表明加强和贝尔曼最优性操作员如何看作是同一硬币的两个方面。

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $π$ to a better one $\mathcal{I}π$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}π$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题