汤普森对线性二次均值球队的采样

论文标题

汤普森对线性二次均值球队的采样

Thompson sampling for linear quadratic mean-field teams

论文作者

Gagrani, Mukul, Sudhakara, Sagar, Mahajan, Aditya, Nayyar, Ashutosh, Ouyang, Yi

论文摘要

我们考虑对未知的多代理线性二次（LQ）系统的最佳控制，其中动态和成本通过状态和控制的平均场（即经验平均值）在整个代理之间耦合。在此类模型中，直接使用单格LQ学习算法会导致遗憾，从而随着代理的数量而多样地增加。我们提出了一种新的基于汤普森抽样的学习算法，该算法利用了系统模型的结构，并表明我们提议的算法的预期贝叶斯对具有$ | m | $不同类型的系统的算法的遗憾是$ | m | $ t $ t $ t $ is $ \ \ \ tilde {\ tilde {\ mathcal {\ mathcal {o}}}}不论代理总数如何，其中$ \ tilde {\ mathcal {o}} $表示法将对数因素隐藏在$ t $中。我们提出了详细的数值实验，以说明所提出算法的显着特征。

We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents through the mean-field (i.e., empirical mean) of the states and controls. Directly using single-agent LQ learning algorithms in such models results in regret which increases polynomially with the number of agents. We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of $|M|$ different types at time horizon $T$ is $\tilde{\mathcal{O}} \big( |M|^{1.5} \sqrt{T} \big)$ irrespective of the total number of agents, where the $\tilde{\mathcal{O}}$ notation hides logarithmic factors in $T$. We present detailed numerical experiments to illustrate the salient features of the proposed algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题