论文标题
悬崖潜水:在加固学习环境中探索奖励表面
Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments
论文作者
论文摘要
可视化优化景观导致了数字优化的许多基本见解,并对优化技术进行了新的改进。但是,增强学习优化(“奖励表面”)的目标的可视化仅是针对少数狭窄的背景而产生的。这项工作首次介绍了27种最广泛使用的增强学习环境的奖励表面和相关的可视化。我们还探索了政策梯度方向上的奖励表面,并首次表明许多流行的强化学习环境经常出现“悬崖”(预期回报中突然下降)。我们证明,A2C经常将这些悬崖“脱落”到参数空间的低奖励区域,而PPO避开了它们,这证实了PPO对PPO的改善性能比以前的方法的流行直觉。我们还引入了一个高度可扩展的库,使研究人员将来可以轻松地生成这些可视化。我们的发现提供了新的直觉,以解释现代RL方法的成功和失败,我们的可视化构成了以新颖方式来表征强化学习剂的几种失败模式。
Visualizing optimization landscapes has led to many fundamental insights in numeric optimization, and novel improvements to optimization techniques. However, visualizations of the objective that reinforcement learning optimizes (the "reward surface") have only ever been generated for a small number of narrow contexts. This work presents reward surfaces and related visualizations of 27 of the most widely used reinforcement learning environments in Gym for the first time. We also explore reward surfaces in the policy gradient direction and show for the first time that many popular reinforcement learning environments have frequent "cliffs" (sudden large drops in expected return). We demonstrate that A2C often "dives off" these cliffs into low reward regions of the parameter space while PPO avoids them, confirming a popular intuition for PPO's improved performance over previous methods. We additionally introduce a highly extensible library that allows researchers to easily generate these visualizations in the future. Our findings provide new intuition to explain the successes and failures of modern RL methods, and our visualizations concretely characterize several failure modes of reinforcement learning agents in novel ways.