通用线性匪徒的有效算法：在线随机梯度下降和汤普森采样

论文标题

通用线性匪徒的有效算法：在线随机梯度下降和汤普森采样

An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling

论文作者

Ding, Qin, Hsieh, Cho-Jui, Sharpnack, James

论文摘要

我们考虑了上下文的强盗问题，其中玩家根据过去的观察结果顺序做出决策，以最大程度地提高累积奖励。尽管已经提出了许多用于上下文强盗的算法，但其中大多数都依赖于每次迭代时找到最大似然估计器，这需要$ o（t）$时间在$ t $ the的迭代时，并且内存效率低下。解决此问题的一种自然方法是应用在线随机梯度下降（SGD），以便可以将每步的时间和记忆复杂性降低至$ t $的恒定，但是基于在线SGD更新的上下文强盗策略可以使平衡探索和利用平衡尚未难以捉摸。在这项工作中，我们证明可以将在线SGD应用于广义线性匪徒问题。提出的SGD-TS算法，它使用单步SGD更新来利用过去的信息并使用汤普森采样进行探索，实现了$ \ tilde {o}（\ sqrt {t}）$遗憾的是，与$ t $和$ d $的总时间复杂性，$ t $ the $ d $ d $ d ys $ d ys $ d ys $ d ys $ d ys $ d ys $ d ys $ d ys $ d。实验结果表明，SGD-TS始终优于合成和实际数据集上的现有算法。

We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $\tilde{O}(\sqrt{T})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题