在多玩家随机游戏中学习

论文标题

在多玩家随机游戏中学习

Learning in Multi-Player Stochastic Games

论文作者

Brown, William

论文摘要

我们考虑在随机游戏中同时学习的问题，其中许多玩家在有限的马环境中。虽然随机游戏的典型目标解决方案是纳什均衡，但对于许多玩家来说，这是棘手的。相反，我们专注于{\ IT相关的平衡}的变体，例如为广泛形式游戏所研究的变体。我们从对抗性MDP问题的硬度结果开始：即使是3的地平线，在奖励和过渡都是对抗性的时候，对最佳非平稳政策的统一后悔也是\ textsf {np} - hard。这意味着即使在具有恒定范围的随机游戏中，也无法通过将黑盒减少到无重格算法（除非$ \ textsf {np} \ subseteq \ textsf {bpp {bpp {bpp} $）。取而代之的是，我们转向一个不同的目标：{\它会在所有玩家使用时产生}平衡的算法。我们的主要结果是算法，该算法生成{\ it广泛形式}相关的平衡，其运行时的运行时间在地平线上是指数的，但在所有其他参数中都多项式。我们提供了类似的算法，该算法在“快速混合”随机游戏的所有参数中都是多项式。我们还展示了一种有效达到正常形式的粗略相关平衡的方法，该游戏遵循传统的无regret方法。当有共享的随机性可用时，可以扩展两种生成算法，以同时给予遗憾的界限并在传统意义上融合。

We consider the problem of simultaneous learning in stochastic games with many players in the finite-horizon setting. While the typical target solution for a stochastic game is a Nash equilibrium, this is intractable with many players. We instead focus on variants of {\it correlated equilibria}, such as those studied for extensive-form games. We begin with a hardness result for the adversarial MDP problem: even for a horizon of 3, obtaining sublinear regret against the best non-stationary policy is \textsf{NP}-hard when both rewards and transitions are adversarial. This implies that convergence to even the weakest natural solution concept -- normal-form coarse correlated equilbrium -- is not possible via black-box reduction to a no-regret algorithm even in stochastic games with constant horizon (unless $\textsf{NP}\subseteq\textsf{BPP}$). Instead, we turn to a different target: algorithms which {\it generate} an equilibrium when they are used by all players. Our main result is algorithm which generates an {\it extensive-form} correlated equilibrium, whose runtime is exponential in the horizon but polynomial in all other parameters. We give a similar algorithm which is polynomial in all parameters for "fast-mixing" stochastic games. We also show a method for efficiently reaching normal-form coarse correlated equilibria in "single-controller" stochastic games which follows the traditional no-regret approach. When shared randomness is available, the two generative algorithms can be extended to give simultaneous regret bounds and converge in the traditional sense.

下载PDF全文

下载文献需遵守相关版权规定

论文标题