论文标题
实验设计,以最大程度地减少线性匪徒
Experimental Design for Regret Minimization in Linear Bandits
论文作者
论文摘要
在本文中,我们提出了一种新型的基于实验设计的算法,以最大程度地减少在线随机线性和组合匪内的遗憾。尽管现有的文献倾向于将基于乐观的算法集中在许多情况下是次优的算法 - 我们的方法仔细计划了通过平衡信息获得和奖励之间的权衡,克服乐观的失败,从而采取的行动。此外,我们利用经验过程的上面理论中利用工具来获得遗憾,可以保证该尺度具有行动集的高斯宽度,避免了浪费的联盟界限。我们提供最新的有限时间遗憾保证,并表明我们的算法可以应用于匪徒和半伴兰反馈制度。在组合半循环设置中,我们表明我们的算法在计算上是有效的,并且仅依赖于对线性最大化甲骨文的调用。此外,我们表明,通过轻微的修改,我们的算法可用于纯探索,并在半伴随的环境中获得最先进的纯探索保证。最后,据我们所知,我们提供了第一个在半狂欢制度中乐观失败的示例,并证明在这种情况下,我们的算法成功了。
In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.