论文标题
通过预测处理近端政策优化有效的深入强化学习
Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization
论文作者
论文摘要
加强学习的进步(RL)通常依赖大量的计算资源,并臭名昭著地采样效率低下。相比之下,人脑能够使用有限的资源有效地学习有效的控制策略。这就提出了一个问题,是否可以使用神经科学的见解来改善当前的RL方法。预测处理是一个流行的理论框架,它坚持认为人脑正在积极寻求最大程度地减少惊喜。我们表明,可以利用自己的感官状态来最大程度地减少惊喜,从而在累积奖励中获得可观的收益,从而可以将其自身的感觉状态降至最低。具体而言,我们介绍了预测处理近端策略优化(P4O)代理;通过将世界模型集成在其隐藏状态下,将预测处理应用于PPO算法的经常性变体中,将预测处理应用于PPO算法的反复变体。即使没有超参数调整,P4O也可以使用单个GPU在多个ATARI游戏上的PPO算法的基线复发变体大大优于基线复发。在相同的墙壁上,它还胜过其他最先进的代理商,并且超过了包括Seaquest在内的多个游戏的人类游戏玩家的性能,Seaquest是Atari域中特别具有挑战性的环境。总而言之,我们的工作强调了神经科学领域的见解如何支持更有能力,有效的人造药物的发展。
Advances in reinforcement learning (RL) often rely on massive compute resources and remain notoriously sample inefficient. In contrast, the human brain is able to efficiently learn effective control strategies using limited resources. This raises the question whether insights from neuroscience can be used to improve current RL methods. Predictive processing is a popular theoretical framework which maintains that the human brain is actively seeking to minimize surprise. We show that recurrent neural networks which predict their own sensory states can be leveraged to minimise surprise, yielding substantial gains in cumulative reward. Specifically, we present the Predictive Processing Proximal Policy Optimization (P4O) agent; an actor-critic reinforcement learning agent that applies predictive processing to a recurrent variant of the PPO algorithm by integrating a world model in its hidden state. Even without hyperparameter tuning, P4O significantly outperforms a baseline recurrent variant of the PPO algorithm on multiple Atari games using a single GPU. It also outperforms other state-of-the-art agents given the same wall-clock time and exceeds human gamer performance on multiple games including Seaquest, which is a particularly challenging environment in the Atari domain. Altogether, our work underscores how insights from the field of neuroscience may support the development of more capable and efficient artificial agents.