论文标题
广义线性上下文强盗中的延迟自适应学习
Delay-Adaptive Learning in Generalized Linear Contextual Bandits
论文作者
论文摘要
在本文中,我们考虑在没有立即观察到奖励的广义线性上下文匪徒中。取而代之的是,只有在延迟延迟之后才能获得奖励,这是未知和随机的。我们研究了适合此延迟设置的两种著名算法的性能:一个基于上置信度范围,另一个基于汤普森采样。我们描述了如何对这两种算法进行调整以处理延迟并为这两种算法的遗憾特征的修改。我们的结果通过确定两种算法都可以使延误变得强大,从而有助于上下文匪徒文献的广泛景观,从而有助于澄清并重申这两种算法的经验成功,这些算法在现代推荐引擎中广泛部署。
In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. Instead, rewards are available to the decision-maker only after some delay, which is unknown and stochastic. We study the performance of two well-known algorithms adapted to this delayed setting: one based on upper confidence bounds, and the other based on Thompson sampling. We describe modifications on how these two algorithms should be adapted to handle delays and give regret characterizations for both algorithms. Our results contribute to the broad landscape of contextual bandits literature by establishing that both algorithms can be made to be robust to delays, thereby helping clarify and reaffirm the empirical success of these two algorithms, which are widely deployed in modern recommendation engines.