论文标题
匪徒学习,行动延迟影响
Bandit Learning with Delayed Impact of Actions
论文作者
论文摘要
我们考虑一个随机的多武器强盗(MAB)问题,其动作的影响延迟。在我们的环境中,过去采取的行动会影响到随后的未来手臂奖励。在现实世界中,行动的这种延迟影响很普遍。例如,在某个社会群体中为人民偿还贷款的能力可能取决于历史上批准贷款申请的频率。如果银行不断拒绝向不利组织中的人们申请贷款申请,则可能会创建一个反馈循环,并进一步损害为该小组中的人们获得贷款的机会。在本文中,我们在多军匪徒的背景下制定了行动的这种延迟和长期影响。我们将匪徒设置概括,以编码由于学习过程中的动作历史而导致的“偏见”的依赖性。目的是随着时间的推移最大化收集的公用事业,同时考虑了历史行动的延迟影响所产生的动态。我们提出了一种算法,该算法对$ \ tilde {\ Mathcal {o}}(kt^{2/3})$ $,并显示出$ω(kt^{2/3})$的匹配后悔的下限,$ k $是$ k $的and and and and and and and and and and and and and and and and and and and and Learning melvelon。我们的结果通过添加技术来应对具有长期影响的动作,并对设计公平算法产生影响,从而补充了强盗文献。
We consider a stochastic multi-armed bandit (MAB) problem with delayed impact of actions. In our setting, actions taken in the past impact the arm rewards in the subsequent future. This delayed impact of actions is prevalent in the real world. For example, the capability to pay back a loan for people in a certain social group might depend on historically how frequently that group has been approved loan applications. If banks keep rejecting loan applications to people in a disadvantaged group, it could create a feedback loop and further damage the chance of getting loans for people in that group. In this paper, we formulate this delayed and long-term impact of actions within the context of multi-armed bandits. We generalize the bandit setting to encode the dependency of this "bias" due to the action history during learning. The goal is to maximize the collected utilities over time while taking into account the dynamics created by the delayed impacts of historical actions. We propose an algorithm that achieves a regret of $\tilde{\mathcal{O}}(KT^{2/3})$ and show a matching regret lower bound of $Ω(KT^{2/3})$, where $K$ is the number of arms and $T$ is the learning horizon. Our results complement the bandit literature by adding techniques to deal with actions with long-term impacts and have implications in designing fair algorithms.