论文标题
代表性学习一般低级马尔可夫游戏
Representation Learning for General-sum Low-rank Markov Games
论文作者
论文摘要
我们研究具有非线性函数近似值的多代理通用马尔可夫游戏。我们专注于低级马尔可夫游戏,其过渡矩阵在未知的非线性表示之上承认隐藏的低级结构。目的是设计一种算法,即(1)在没有对环境或代表的事先了解的情况下有效地找到$ \ varepsilon $平衡策略样本,并且(2)允许深入学习的友好实现。我们利用代表性学习,并提出一种基于模型的和无模型的方法来从收集的数据中构建有效表示。对于这两种方法,该算法都达到了Poly $(H,D,A,1/\ Varepsilon)$的样本复杂性,其中$ h $是游戏范围,$ d $是特征向量的尺寸,$ a $ a $是联合行动空间的大小,$ \ varepsilon $是最佳差距。当玩家数量较大时,上述样本复杂性可以在最坏情况下与玩家数量成倍扩展。为了应对这一挑战,我们将Markov Games具有分解的过渡结构,并提出了一种逃避这种指数缩放的算法。据我们所知,这是第一个用于包含(非线性)函数近似(非线性)函数近似值的多代理和Markov游戏的样品效率算法。我们伴随我们的理论结果进行了基于神经网络的算法实现,并根据广泛使用的深层RL基线,具有虚拟播放的DQN进行了评估。
We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from the collected data. For both approaches, the algorithm achieves a sample complexity of poly$(H,d,A,1/\varepsilon)$, where $H$ is the game horizon, $d$ is the dimension of the feature vector, $A$ is the size of the joint action space and $\varepsilon$ is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play.