在线性土匪中表示学习的影响

论文标题

在线性土匪中表示学习的影响

Impact of Representation Learning in Linear Bandits

论文作者

Yang, Jiaqi, Hu, Wei, Lee, Jason D., Du, Simon S.

论文摘要

我们研究表示学习如何提高强盗问题的效率。我们研究了与尺寸$ d $同时播放$ t $线性匪徒的设置，这些$ t $ bandit任务共享一个常见的$ k（\ ll d）$尺寸线性表示。对于有限行动设置，我们提出了一种新的算法，该算法可以实现$ \ widetilde {o}（t \ sqrt {kn} + \ sqrt {dknt}）$遗憾，其中$ n $是我们为每个匪徒玩的回合数。当$ t $足够大时，我们的算法大大优于幼稚算法（独立播放$ t $ bandits），可以实现$ \ widetilde {o}（t \ sqrt {d n}）$遗憾。我们还提供了$ω（t \ sqrt {kn} + \ sqrt {dknt}）$遗憾的下限，表明我们的算法是最小值 - 最典型，直到多元素含量。此外，我们将我们的算法扩展到无限效法设置，并获得相应的遗憾界限，以证明在某些制度中的代表性学习的好处。我们还介绍了关于合成和现实世界数据的实验，以说明我们的理论发现并证明我们提出的算法的有效性。

We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit. When $T$ is sufficiently large, our algorithm significantly outperforms the naive algorithm (playing $T$ bandits independently) that achieves $\widetilde{O}(T\sqrt{d N})$ regret. We also provide an $Ω(T\sqrt{kN} + \sqrt{dkNT})$ regret lower bound, showing that our algorithm is minimax-optimal up to poly-logarithmic factors. Furthermore, we extend our algorithm to the infinite-action setting and obtain a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes. We also present experiments on synthetic and real-world data to illustrate our theoretical findings and demonstrate the effectiveness of our proposed algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题