学习价值在深层政策梯度中使用残余差异的功能

论文标题

学习价值在深层政策梯度中使用残余差异的功能

Learning Value Functions in Deep Policy Gradients using Residual Variance

论文作者

Flet-Berliac, Yannis, Ouhamma, Reda, Maillard, Odalric-Ambrym, Preux, Philippe

论文摘要

事实证明，政策梯度算法在各种决策和控制任务方面取得了成功。但是，这些方法患有很高的样本复杂性和不稳定性问题。在本文中，我们通过在演员批评框架中培训评论家的方法来解决这些挑战。我们的工作基于最近的研究，表明传统的参与者批评算法并不能成功地符合真实的价值功能，呼吁需要确定对评论家的更好目标。在我们的方法中，评论家使用新的状态价值（分别为州行动值）函数近似，该功能近似相对于其平均值而不是传统参与者 - critic的绝对价值，以了解状态的价值（分别国家行动对）。我们证明了新梯度估计器的理论一致性，并观察到各种连续控制任务和算法的经验改善。此外，我们在稀疏奖励的任务中验证了我们的方法，我们提供了实验性证据和理论见解。

Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.

下载PDF全文

下载文献需遵守相关版权规定

论文标题