论文标题
一种用于增强学习的混合随机策略梯度算法
A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning
论文作者
论文摘要
我们通过将公正的策略梯度估计器(增强估计器)与另一个有偏见的莎拉(Sarah)估计器相结合,以进行策略优化。混合策略梯度估计器被证明是有偏见的,但具有差异的属性。使用此估计器,我们开发了一种新的近端混合随机策略梯度算法(ProxHSPGA)来解决一个复合策略优化问题,该问题使我们能够在策略参数上处理约束或正规化程序。我们首先提出了一种单循环算法,然后引入更实用的重新启动变体。我们证明,这两种算法都可以达到最著名的轨迹复杂性$ \ Mathcal {o} \ left(\ varepsilon^{ - 3} \ right)$,以达到复合问题的一阶固定点,该点比现有的reacdp $ \ \ nathcal {$ w peart(4)和svrpg $ \ Mathcal {o} \ left(\ varepsilon^{ - 10/3} \ right)$在非复合设置中。我们评估了算法在增强学习中的几个著名示例中的性能。数值结果表明,我们的算法在这些示例上的表现优于现有方法。此外,与非复合材料相比,复合设置确实具有一些优势。
We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estimator is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity $\mathcal{O}\left(\varepsilon^{-3}\right)$ to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP $\mathcal{O}\left(\varepsilon^{-4}\right)$ and SVRPG $\mathcal{O}\left(\varepsilon^{-10/3}\right)$ in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.