通过梯度上升的元学习匪徒政策

论文标题

通过梯度上升的元学习匪徒政策

Meta-Learning Bandit Policies by Gradient Ascent

论文作者

Kveton, Branislav, Mladenov, Martin, Hsu, Chih-Wei, Zaheer, Manzil, Szepesvari, Csaba, Boutilier, Craig

论文摘要

大多数强盗政策旨在最大程度地减少任何问题的遗憾，对基本环境或贝叶斯意义上的假设很少，或者假设先前的分布在环境参数上。前者在实际环境中通常过于保守，而后者则需要在实践中难以验证的假设。我们研究了这两个极端之间的匪徒问题，其中学习代理可以从未知的先前发行版$ \ Mathcal {p} $中访问采样的匪徒实例，并旨在平均从$ \ Mathcal {p} $中获得的强盗实例获得高奖励。这种设置非常重要，因为它为匪徒政策的元学习奠定了基础，并反映了许多实际领域中更现实的假设。我们建议使用可区分的参数化匪徒策略，可以使用策略梯度进行优化。这提供了一个易于实施的广泛适用框架。我们得出奖励梯度，这些梯度反映了针对非上下文和上下文设置的匪徒问题和政策的结构，并提出了许多有趣的政策，这些政策既可以区分又不后悔。我们的算法和理论贡献得到了广泛的实验，这些实验表明了基线减法的重要性，学习的偏见以及我们方法在范围问题上的实用性。

Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former are often too conservative in practical settings, while the latter require assumptions that are hard to verify in practice. We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$ and aims to achieve high reward on average over the bandit instances drawn from $\mathcal{P}$. This setting is of a particular importance because it lays foundations for meta-learning of bandit policies and reflects more realistic assumptions in many practical domains. We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients. This provides a broadly applicable framework that is easy to implement. We derive reward gradients that reflect the structure of bandit problems and policies, for both non-contextual and contextual settings, and propose a number of interesting policies that are both differentiable and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题