论文标题
随机线性匪徒的一般理论及其应用
A General Theory of the Stochastic Linear Bandit and Its Applications
论文作者
论文摘要
在实践中,最近对实验的采用量不断增长,导致人们对多叶匪徒的关注激增,作为降低在线实验的机会成本的一种技术。在这种情况下,一个决策者在一组给定的动作中依次选择,观察他们的嘈杂奖励,并旨在最大程度地提高其累积的预期奖励(或最小化的遗憾),而不是长度$ t $的地平线。在本文中,我们引入了一个通用分析框架和随机线性匪徒问题的一系列算法,其中包括众所周知的算法,例如在不确定性的线性线性伴随(Oful)和汤普森采样(TS)(TS)中(特殊情况)。我们的分析技术桥接了几个先前文献流,并产生许多新的结果。首先,我们对预期的乐观情绪的新概念产生了一种新算法,称为筛选贪婪(SG),从而减少了Oful中的过度探索问题。 SG利用数据以相对较低的不确定性丢弃动作,然后在贪婪的其余动作中选择一个。除了证明SG在理论上是最佳评分之外,我们的经验模拟还表明,SG的表现优于诸如贪婪,Oful和TS等现有基准。我们一般框架的第二个应用是(据我们所知),在与Goldenshluger和Zeevi(2013)相似的条件下,第一个Polyogarithmic(以$ t $)为遗憾的是Oful和TS的范围。最后,我们以$ \ sqrt {k} $的一倍为$ k $ armed上下文mab获得了更尖锐的后悔界限。
Recent growing adoption of experimentation in practice has led to a surge of attention to multiarmed bandits as a technique to reduce the opportunity cost of online experiments. In this setting, a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward (or minimize regret) over a horizon of length $T$. In this paper, we introduce a general analysis framework and a family of algorithms for the stochastic linear bandit problem that includes well-known algorithms such as the optimism-in-the-face-of-uncertainty-linear-bandit (OFUL) and Thompson sampling (TS) as special cases. Our analysis technique bridges several streams of prior literature and yields a number of new results. First, our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. SG utilizes the data to discard actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS. The second application of our general framework is (to the best of our knowledge) the first polylogarithmic (in $T$) regret bounds for OFUL and TS, under similar conditions as the ones by Goldenshluger and Zeevi (2013). Finally, we obtain sharper regret bounds for the $k$-armed contextual MABs by a factor of $\sqrt{k}$.