具有线性函数近似的Leader-proselroter MDP中可证明有效的无模型RL

论文标题

具有线性函数近似的Leader-proselroter MDP中可证明有效的无模型RL

Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation

论文作者

Ghosh, Arnob

论文摘要

我们考虑了一个多代理的情节MDP设置，代理（领导者）在情节的每个步骤中采取行动，然后是另一个代理商（追随者）。国家进化和奖励取决于领导者和追随者的联合行动对。这种类型的互动可以在许多领域中找到应用程序，例如智能电网，机制设计，安全性和决策。我们对如何在强盗反馈设置下为两位具有可证明性能保证的玩家学习政策感兴趣。我们专注于一个设置，领导者和追随者都是{\ em nonnonyopic}，即，他们都试图在整个情节中最大化自己的奖励，并考虑一个线性MDP，该线性MDP可以建模连续状态空间，这在许多RL应用中很常见。我们提出了一个{\ em无模型} rl算法，并表明$ \ tilde {\ nathcal {o}}（\ sqrt {\ sqrt {d^3h^3t}）$遗憾的界限可以为领导者和追随者提供了$ d $的位置，$ d是$ y is $ he $ he $ he $ he $ he $ he $强盗反馈信息设置。因此，即使国家的数量变得无限，我们的结果也会成立。该算法依赖于LSVI-UCB算法的{\ em Novel}适应。具体来说，我们用领导者和追随者的软马克斯政策代替了标准的贪婪政策（作为最佳响应）。事实证明，这是建立为价值函数结合的均匀浓度的关键。据我们所知，这是Markov游戏的第一个子线性遗憾的保证，具有功能近似功能的非侧重追随者。

We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {\em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {\em model-free} RL algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {\em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题