论文标题
土匪中近乎最佳的协作学习
Near-Optimal Collaborative Learning in Bandits
论文作者
论文摘要
本文介绍了一个通用的多代理匪徒模型,在该模型中,每个代理都面对有限的武器,并可以通过中央控制器与其他代理商进行通信,以便在纯粹的探索或玩耍时以最小化的最佳方式识别其最佳武器。扭曲是,每个代理的最佳臂是具有最大预期混合奖励的手臂,其中臂的混合奖励是所有代理商的奖励的加权总和。这使得代理之间的沟通通常是必要的。这种一般设置允许恢复并扩展几个最近的合作匪徒学习模型,包括最近提出的使用个性化的联合学习(Shi等,2021)。在本文中,我们为纯探索的样本复杂性和遗憾提供了新的下限。然后,我们提出了一种纯粹的算法,以供纯探索。该算法基于分阶段消除,并具有两种新颖的成分:在每个阶段内的数据依赖性采样方案,旨在匹配下限的松弛。
This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.