在近端政策优化中重新审视设计选择

论文标题

在近端政策优化中重新审视设计选择

Revisiting Design Choices in Proximal Policy Optimization

论文作者

Hsu, Chloe Ching-Yun, Mendler-Dünner, Celestine, Hardt, Moritz

论文摘要

近端政策优化（PPO）是一种流行的深层政策梯度算法。在标准实现中，PPO将策略更新与剪辑概率比率正规化，并以连续的高斯分布或离散的软马克斯分布进行参数化策略。这些设计选择是广泛接受的，并以对Mujoco和Atari基准测试的经验性能比较进行了激励。我们在当前基准的制度之外重新审视这些实践，并揭示标准PPO的三种故障模式。我们解释了为什么在这些情况下标准设计选择有问题，并表明替代目标和策略参数化的替代选择可以防止故障模式。我们希望我们的工作能够提醒人们，在增强学习中的许多算法设计选择与特定的仿真环境相关联。我们不应隐含地接受这些选择作为更通用算法的标准部分。

Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO. We explain why standard design choices are problematic in these cases, and show that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. We hope that our work serves as a reminder that many algorithmic design choices in reinforcement learning are tied to specific simulation environments. We should not implicitly accept these choices as a standard part of a more general algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题