论文标题
部分可观察到的马尔可夫游戏的样品有效的增强学习
Sample-Efficient Reinforcement Learning of Partially Observable Markov Games
论文作者
论文摘要
本文考虑了部分可观察性的多代理增强学习(MARL)的具有挑战性的任务,在这种可观察性下,每个代理只看到自己的个人观察和行动,这些观察和行动揭示了有关系统基本状态的不完整信息。本文根据多人游戏通用和可观察到的马尔可夫游戏(POMGS)的一般模型研究了这些任务,该模型明显大于不完美的信息广泛形式游戏(IIEFGS)的标准模型。我们确定了丰富的POMG子类(弱揭示的POMG),其中样品有效学习是可探讨的。在自我播放的环境中,我们证明一种简单的算法结合了乐观和最大似然估计(MLE)足以找到近似的nash平衡,相关平衡,相关平衡,以及在弱揭示的POMG的粗糙相关平衡中,在一个多功能的样本中,当数量的Agents小型时数。在与对抗对手的比赛的环境中,我们表明,我们乐观的MLE算法的一种变体能够在与最佳最大蛋白政策进行比较时实现统一的遗憾。据我们所知,这项工作为学习POMG提供了第一线样本效率结果。
This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.