论文标题
有限的建议对序列预测的不断遗憾
Constant regret for sequence prediction with limited advice
论文作者
论文摘要
我们研究了在有限的信息访问情况下,针对有限大小的有限型家族的最佳专家,对单个序列预测的累积遗憾最小化问题。我们假设在每一轮中,学习者可以使用最多P专家的凸组合进行预测进行预测,然后他们可以观察后验,最多是M专家的损失。我们假设损耗函数是范围内的,并且是Exp-Concave。在标准的多武器匪徒设置中,当允许学习者每回合只玩一个专家并仅观察其反馈时,已知的最佳后悔界限就是O($ \ sqrt $ kt)。我们表明,允许学习者每回合额外扮演一名专家,并观察一个额外的反馈可以大大提高后悔的保证。我们提供了一项策略,仅将P = 2个专家组合为预测和观察M $ \ ge 2 $ 2的专家的损失。它的随机遗憾(wrt。学习者策略的内部随机化)是O(k/m)日志(k $δ$ -1),概率为1- $δ$,即,如果(p $ \ ge $ 2和m $ 2和m $ \ ge $ 3),则独立于地平线t(“常数”或“快速率”遗憾)。我们证明,此速率是K中的最佳选择。我们的策略不需要对视野T的任何先验知识,也不需要置信参数$δ$。最后,我们表明,如果学习者每回合只观察到一个专家反馈,那么最糟糕的遗憾是“缓慢的速率” $ω$($ \ sqrt $ kt),这表明每轮至少对至少两名专家的同步观察是不断遗憾的。
We investigate the problem of cumulative regret minimization for individual sequence prediction with respect to the best expert in a finite family of size K under limited access to information. We assume that in each round, the learner can predict using a convex combination of at most p experts for prediction, then they can observe a posteriori the losses of at most m experts. We assume that the loss function is range-bounded and exp-concave. In the standard multi-armed bandits setting, when the learner is allowed to play only one expert per round and observe only its feedback, known optimal regret bounds are of the order O($\sqrt$ KT). We show that allowing the learner to play one additional expert per round and observe one additional feedback improves substantially the guarantees on regret. We provide a strategy combining only p = 2 experts per round for prediction and observing m $\ge$ 2 experts' losses. Its randomized regret (wrt. internal randomization of the learners' strategy) is of order O (K/m) log(K$δ$ --1) with probability 1 -- $δ$, i.e., is independent of the horizon T ("constant" or "fast rate" regret) if (p $\ge$ 2 and m $\ge$ 3). We prove that this rate is optimal up to a logarithmic factor in K. In the case p = m = 2, we provide an upper bound of order O(K 2 log(K$δ$ --1)), with probability 1 -- $δ$. Our strategies do not require any prior knowledge of the horizon T nor of the confidence parameter $δ$. Finally, we show that if the learner is constrained to observe only one expert feedback per round, the worst-case regret is the "slow rate" $Ω$($\sqrt$ KT), suggesting that synchronous observation of at least two experts per round is necessary to have a constant regret.