分散的乐观的超镜镜下降：马尔可夫游戏中的无重组学习

论文标题

分散的乐观的超镜镜下降：马尔可夫游戏中的无重组学习

Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games

论文作者

Zhan, Wenhao, Lee, Jason D., Yang, Zhuoran

论文摘要

我们在马尔可夫游戏中研究分散的政策学习，在那里我们控制一个代理商与非组织和可能的对手对手一起玩。我们的目标是开发一种无需在线学习算法，该算法（i）根据代理商观察到的本地信息采取行动，（ii）能够在事后找到最佳的政策。对于这样一个问题，由于不同对手而引起的非平稳状态转变构成了重大挑战。鉴于最近的硬度结果\ citep {liu2022222222222222222222222，我们专注于对手以前的策略向代理商进行决策的环境。有了这样的信息结构，我们提出了一种新的算法，\下划线{d} eCtralized \ underline {o} ptimistion hype \ unde \ undeline {r}策略m \下划线{i} rror de \ de \ underline {s} cent {s} cent（doris（doris），$ \ s $ \ s $ \ sqrt sqrt sqrt sqrt {是情节的数量。此外，当所有代理人采用多丽丝时，我们证明它们的混合策略构成了近似的粗糙相关平衡。特别是，多丽丝（Doris）维护\ textit {hyperpolicy}，这是策略空间上的分布。超质量通过镜下更新，其中更新方向是通过最小二乘策略评估的乐观变体获得的。此外，为了说明我们方法的力量，我们将多丽丝应用于受约束和矢量值的MDP，可以用虚拟对手将其作为零和马尔可夫游戏配制。

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题