有限的建议对序列预测的不断遗憾

论文标题

有限的建议对序列预测的不断遗憾

Constant regret for sequence prediction with limited advice

论文作者

Saad, El Mehdi, Blanchard, G.

论文摘要

我们研究了在有限的信息访问情况下，针对有限大小的有限型家族的最佳专家，对单个序列预测的累积遗憾最小化问题。我们假设在每一轮中，学习者可以使用最多P专家的凸组合进行预测进行预测，然后他们可以观察后验，最多是M专家的损失。我们假设损耗函数是范围内的，并且是Exp-Concave。在标准的多武器匪徒设置中，当允许学习者每回合只玩一个专家并仅观察其反馈时，已知的最佳后悔界限就是O（$ \ sqrt $ kt）。我们表明，允许学习者每回合额外扮演一名专家，并观察一个额外的反馈可以大大提高后悔的保证。我们提供了一项策略，仅将P = 2个专家组合为预测和观察M $ \ ge 2 $ 2的专家的损失。它的随机遗憾（wrt。学习者策略的内部随机化）是O（k/m）日志（k $δ$ -1），概率为1- $δ$，即，如果（p $ \ ge $ 2和m $ 2和m $ \ ge $ 3），则独立于地平线t（“常数”或“快速率”遗憾）。我们证明，此速率是K中的最佳选择。我们的策略不需要对视野T的任何先验知识，也不需要置信参数$δ$。最后，我们表明，如果学习者每回合只观察到一个专家反馈，那么最糟糕的遗憾是“缓慢的速率” $ω$（$ \ sqrt $ kt），这表明每轮至少对至少两名专家的同步观察是不断遗憾的。

We investigate the problem of cumulative regret minimization for individual sequence prediction with respect to the best expert in a finite family of size K under limited access to information. We assume that in each round, the learner can predict using a convex combination of at most p experts for prediction, then they can observe a posteriori the losses of at most m experts. We assume that the loss function is range-bounded and exp-concave. In the standard multi-armed bandits setting, when the learner is allowed to play only one expert per round and observe only its feedback, known optimal regret bounds are of the order O($\sqrt$ KT). We show that allowing the learner to play one additional expert per round and observe one additional feedback improves substantially the guarantees on regret. We provide a strategy combining only p = 2 experts per round for prediction and observing m $\ge$ 2 experts' losses. Its randomized regret (wrt. internal randomization of the learners' strategy) is of order O (K/m) log(K$δ$ --1) with probability 1 -- $δ$, i.e., is independent of the horizon T ("constant" or "fast rate" regret) if (p $\ge$ 2 and m $\ge$ 3). We prove that this rate is optimal up to a logarithmic factor in K. In the case p = m = 2, we provide an upper bound of order O(K 2 log(K$δ$ --1)), with probability 1 -- $δ$. Our strategies do not require any prior knowledge of the horizon T nor of the confidence parameter $δ$. Finally, we show that if the learner is constrained to observe only one expert feedback per round, the worst-case regret is the "slow rate" $Ω$($\sqrt$ KT), suggesting that synchronous observation of at least two experts per round is necessary to have a constant regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题