香草政策梯度是否被忽略了？分析Hanabi的深入增强学习

论文标题

香草政策梯度是否被忽略了？分析Hanabi的深入增强学习

Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi

论文作者

Grooten, Bram, Wemmenhove, Jelle, Poot, Maurice, Portegies, Jim

论文摘要

为了追求增强的多代理协作，我们分析了最近出版的Hanabi基准中的几种彻底的深钢筋学习算法。我们的研究表明，在多代理合作卡游戏的简化环境中，近端政策优化（PPO）的近端政策优化（PPO）的表现也许是一个违反直觉的发现。在对这种行为的分析中，我们调查了Hanabi特异性指标，并假设PPO高原的原因。此外，我们还提供了完美游戏（71圈）和任何游戏（89圈）的最大长度的证明。我们的代码可以在以下网址找到：https：//github.com/bramgrooten/deeprl-for-hanabi

In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Gradient over multiple random seeds in a simplified environment of the multi-agent cooperative card game. In our analysis of this behavior we look into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In addition, we provide proofs for the maximum length of a perfect game (71 turns) and any game (89 turns). Our code can be found at: https://github.com/bramgrooten/DeepRL-for-Hanabi

下载PDF全文

下载文献需遵守相关版权规定

论文标题