论文标题
关于随机多军匪徒的处罚
On Penalization in Stochastic Multi-armed Bandits
论文作者
论文摘要
我们研究了随机多臂匪徒(MAB)问题的重要变体,该变体考虑了惩罚。我们需要在总奖励和公平水平之间取得平衡,而不是直接提高累积预期奖励。在本文中,我们在MAB中介绍了一些新的见解,并在惩罚框架中提出了问题,在该框架中,严格的惩罚可以很好地定义,并且可以进行更复杂的遗憾分析。在这样的框架下,我们提出了一种像UCB一样坚硬的算法,该算法具有许多优点,包括渐近公平,几乎是最佳的遗憾,奖励和公平之间的更好的权衡。依赖于差距和差距独立的遗憾界限均已建立。发出多种有见地的评论来说明我们的理论分析的健全性。许多实验结果证实了理论,并表明了我们方法的优越性,而不是其他现有方法。
We study an important variant of the stochastic multi-armed bandit (MAB) problem, which takes penalization into consideration. Instead of directly maximizing cumulative expected reward, we need to balance between the total reward and fairness level. In this paper, we present some new insights in MAB and formulate the problem in the penalization framework, where rigorous penalized regret can be well defined and more sophisticated regret analysis is possible. Under such a framework, we propose a hard-threshold UCB-like algorithm, which enjoys many merits including asymptotic fairness, nearly optimal regret, better tradeoff between reward and fairness. Both gap-dependent and gap-independent regret bounds have been established. Multiple insightful comments are given to illustrate the soundness of our theoretical analysis. Numerous experimental results corroborate the theory and show the superiority of our method over other existing methods.