论文标题
遗憾的是对非固定的多军匪徒,并有公平的限制
A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints
论文作者
论文摘要
多臂匪徒的框架是研究顺序决策问题策略的最常见平台。最近,公平概念在机器学习社区引起了很多关注。人们可以强加公平条件,即在任何给定时间点,即使在学习阶段,表现不佳的候选人也不应比更好的候选人更喜欢。已知这种公平的约束是最严格的限制之一,并且已经在随机多军匪徒的框架中进行了研究,以建立后悔的界限。本文的主要目的是在非固定环境中研究此问题。我们提出了一种新的算法,称为“公平上限信心”,该算法与探索fair-ucbe算法有关,以解决缓慢变化的随机$ k $ armed bandit问题。有了这一点,我们提出了两个结果:(i)Fair -ucbe确实满足了上述公平条件,并且(ii)遗憾的是$ o \ left(k^{\ frac {3} {2} {2}} {2}} t^{1 - \fracα{2}}}}}}} \ sqrt \ sqrt t} $ t $是时间范围。这是第一个公平的算法,符合我们的最大知识,其均属遗憾的束缚。我们表明,在非平稳案例中,我们的算法的性能将其静止的对应物的变化趋于零。
The multi-armed bandits' framework is the most common platform to study strategies for sequential decision-making problems. Recently, the notion of fairness has attracted a lot of attention in the machine learning community. One can impose the fairness condition that at any given point of time, even during the learning phase, a poorly performing candidate should not be preferred over a better candidate. This fairness constraint is known to be one of the most stringent and has been studied in the stochastic multi-armed bandits' framework in a stationary setting for which regret bounds have been established. The main aim of this paper is to study this problem in a non-stationary setting. We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe algorithm for solving a slowly varying stochastic $k$-armed bandit problem. With this we present two results: (i) Fair-UCBe indeed satisfies the above mentioned fairness condition, and (ii) it achieves a regret bound of $O\left(k^{\frac{3}{2}} T^{1 - \fracα{2}} \sqrt{\log T}\right)$, for some suitable $α\in (0, 1)$, where $T$ is the time horizon. This is the first fair algorithm with a sublinear regret bound applicable to non-stationary bandits to the best of our knowledge. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart as the variation in the environment tends to zero.