论文标题

在线性土匪中表示学习的影响

Impact of Representation Learning in Linear Bandits

论文作者

Yang, Jiaqi, Hu, Wei, Lee, Jason D., Du, Simon S.

论文摘要

我们研究表示学习如何提高强盗问题的效率。我们研究了与尺寸$ d $同时播放$ t $线性匪徒的设置,这些$ t $ bandit任务共享一个常见的$ k(\ ll d)$尺寸线性表示。对于有限行动设置,我们提出了一种新的算法,该算法可以实现$ \ widetilde {o}(t \ sqrt {kn} + \ sqrt {dknt})$遗憾,其中$ n $是我们为每个匪徒玩的回合数。当$ t $足够大时,我们的算法大大优于幼稚算法(独立播放$ t $ bandits),可以实现$ \ widetilde {o}(t \ sqrt {d n})$遗憾。我们还提供了$ω(t \ sqrt {kn} + \ sqrt {dknt})$遗憾的下限,表明我们的算法是最小值 - 最典型,直到多元素含量。此外,我们将我们的算法扩展到无限效法设置,并获得相应的遗憾界限,以证明在某些制度中的代表性学习的好处。我们还介绍了关于合成和现实世界数据的实验,以说明我们的理论发现并证明我们提出的算法的有效性。

We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit. When $T$ is sufficiently large, our algorithm significantly outperforms the naive algorithm (playing $T$ bandits independently) that achieves $\widetilde{O}(T\sqrt{d N})$ regret. We also provide an $Ω(T\sqrt{kN} + \sqrt{dkNT})$ regret lower bound, showing that our algorithm is minimax-optimal up to poly-logarithmic factors. Furthermore, we extend our algorithm to the infinite-action setting and obtain a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes. We also present experiments on synthetic and real-world data to illustrate our theoretical findings and demonstrate the effectiveness of our proposed algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源