通过人类规范化的强化学习和计划来掌握无印刷外交的游戏

论文标题

通过人类规范化的强化学习和计划来掌握无印刷外交的游戏

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

论文作者

Bakhtin, Anton, Wu, David J, Lerer, Adam, Gray, Jonathan, Jacob, Athul Paul, Farina, Gabriele, Miller, Alexander H, Brown, Noam

论文摘要

No-Press外交是一款复杂的战略游戏，涉及合作和竞争，它已成为多代理AI研究的基准。尽管自我扮演的强化学习在纯粹的对抗游戏中取得了许多成功，例如国际象棋，GO和扑克，但仅自我玩法就不足以在涉及与人类合作的领域中实现最佳性能。我们首先引入了一种计划算法来解决这一缺点，我们称DIL-PIKL将奖励最大化的政策规范为人类模仿学习政策。我们证明，这是在修改后的实用程序函数下的一种无需重新学习算法。然后，我们证明DIL-PIKL可以扩展到一种自我强化学习算法中，我们称为RL-DIL-PIKL，该算法提供了人类游戏模型，同时训练对这种人类模型做出良好反应的代理商。我们使用RL-DIL-PIKL来训练我们命名外交官的代理商。在200场无压外交锦标赛中，涉及62名从初学者到专家的技能水平的人类参与者的平均得分高于所有参加两场比赛的其他参与者的平均得分高，并且根据ELO评级模型的排名第一和第三。

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题