在具有不完美上下文观察的土匪中，贪婪政策的最差表现

论文标题

在具有不完美上下文观察的土匪中，贪婪政策的最差表现

Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations

论文作者

Park, Hongju, Faradonbeh, Mohamad Kazem Shirani

论文摘要

上下文匪徒是在具有随时间变化的组件的环境中的不确定性下进行顺序决策的规范模型。在这种情况下，每个匪徒的预期奖励由未知参数的内部产物与该臂的上下文向量组成。经典的匪徒设置在很大程度上依赖于假设上下文是完全观察到的，而对不完美观察到的上下文匪徒的富裕模型的研究是不成熟的。这项工作考虑了采取行动的贪婪增强学习政策，好像参数的当前估计值和未观察到的上下文与相应的真实值一致。我们确定，非反对性最坏的遗感到遗憾会随时间范围和故障概率而增长多同源，而与武器数量线性缩放。还提供了显示上述贪婪政策效率的数值分析。

Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.

下载PDF全文

下载文献需遵守相关版权规定

论文标题