焦躁不安，是一种在线不安强盗的高效且低复杂的算法

论文标题

焦躁不安，是一种在线不安强盗的高效且低复杂的算法

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

论文作者

Wang, Siwei, Huang, Longbo, Lui, John C. S.

论文摘要

我们研究在线不安的强盗问题，每个手臂的状态根据马尔可夫链的发展而发展，拉动臂的奖励取决于拉动的手臂和相应的马尔可夫链的当前状态。在本文中，我们提出了不安的UCB，这是一项遵循探索框架的学习政策。在躁动不安的UCB中，我们提出了一种构建离线实例的新方法，它仅需要$ o（n）$时间复杂性（$ n $是武器数量），并且比现有学习政策的复杂性要好得多。我们还证明，躁动不安的UCB获得了$ \ tilde {o}（（（n+m^3）t^{2 \ over 3}）$的遗憾上限，其中$ m $是马尔可夫链状态空间大小，$ t $是时间范围。与现有算法相比，我们的结果消除了遗憾上限的指数因素（以$ m，n $为单位），这是由于对一般不安的匪徒问题的过渡中的稀疏性进行了新的剥削。结果，我们的分析技术也可以采用以收紧现有算法的遗憾界限。最后，我们根据现实世界数据集进行了实验，以将不安的UCB策略与最先进的基准进行比较。我们的结果表明，躁动不安的核心在后悔中优于现有算法，并大大减少了运行时间。

We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain, and the reward of pulling an arm depends on both the pulled arm and the current state of the corresponding Markov chain. In this paper, we propose Restless-UCB, a learning policy that follows the explore-then-commit framework. In Restless-UCB, we present a novel method to construct offline instances, which only requires $O(N)$ time-complexity ($N$ is the number of arms) and is exponentially better than the complexity of existing learning policy. We also prove that Restless-UCB achieves a regret upper bound of $\tilde{O}((N+M^3)T^{2\over 3})$, where $M$ is the Markov chain state space size and $T$ is the time horizon. Compared to existing algorithms, our result eliminates the exponential factor (in $M,N$) in the regret upper bound, due to a novel exploitation of the sparsity in transitions in general restless bandit problems. As a result, our analysis technique can also be adopted to tighten the regret bounds of existing algorithms. Finally, we conduct experiments based on real-world dataset, to compare the Restless-UCB policy with state-of-the-art benchmarks. Our results show that Restless-UCB outperforms existing algorithms in regret, and significantly reduces the running time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题