探索，剥削和参与多军匪徒而放弃

论文标题

探索，剥削和参与多军匪徒而放弃

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

论文作者

Yang, Zixian, Liu, Xin, Ying, Lei

论文摘要

多臂强盗（MAB）是理解探索探索折衷权衡的经典模型。推荐系统的传统MAB模型假设用户留在整个学习视野的系统中。在新的在线教育平台（例如Aleks或Tiktok和YouTube短裤）等新的视频推荐系统中，用户在应用程序上花费的时间取决于推荐内容的吸引力。如果推荐的项目无法吸引用户，则用户可以暂时离开系统。为了了解这些系统中的探索，剥削和参与度，我们提出了一个称为MAB-A的新模型，其中“ A”代表放弃，而放弃的概率取决于当前推荐的项目以及用户的过去经验（称为状态）。我们提出了两种算法ULCB和KL-ULCB，当用户喜欢以前的推荐项目而更少的探索（在用户不喜欢上一个项目时）时，这两者都会进行更多的探索（非常乐观）。我们证明ULCB和KL-ULCB都达到了对数遗憾，$ O（\ log K）$，其中$ k $是访问的数量（或情节）。此外，在KL-ulcb下束缚的遗憾在渐近上是敏锐的。我们还将提出的算法扩展到通用状态设置。仿真结果证实了我们的理论分析，并表明所提出的算法的遗憾明显低于传统的UCB和KL-UCB以及基于Q学习的算法。

Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not like the previous item. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results confirm our theoretical analysis and show that the proposed algorithms have significantly lower regrets than the traditional UCB and KL-UCB, and Q-learning-based algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题