论文标题
赔率比汤普森采样以控制时间变化效果
Odds-Ratio Thompson Sampling to Control for Time-Varying Effect
论文作者
论文摘要
多臂强盗方法已用于动态实验,尤其是在在线服务中。在这些方法中,汤普森采样被广泛使用,因为它很简单,但表现出了理想的性能。许多用于二进制奖励的汤普森采样方法都使用特定参数化编写的逻辑模型。在这项研究中,我们用优势比参数对逻辑模型进行重新聚集。这表明汤普森采样可以与参数子集一起使用。基于这一发现,我们提出了一种新颖的方法,即“赔率比汤普森抽样”,预计该方法可以与时变效果起作用。通过讨论该方法的理想特性,描述了在连续实验中使用所提出的方法。在仿真研究中,新方法对时间背景效应起着强大的作用,而在没有这种效果的情况下,性能的丧失仅是边缘的。最后,使用来自真实服务的数据集,我们表明这种新方法将在实践环境中获得更大的回报。
Multi-armed bandit methods have been used for dynamic experiments particularly in online services. Among the methods, thompson sampling is widely used because it is simple but shows desirable performance. Many thompson sampling methods for binary rewards use logistic model that is written in a specific parameterization. In this study, we reparameterize logistic model with odds ratio parameters. This shows that thompson sampling can be used with subset of parameters. Based on this finding, we propose a novel method, "Odds-ratio thompson sampling", which is expected to work robust to time-varying effect. Use of the proposed method in continuous experiment is described with discussing a desirable property of the method. In simulation studies, the novel method works robust to temporal background effect, while the loss of performance was only marginal in case with no such effect. Finally, using dataset from real service, we showed that the novel method would gain greater rewards in practical environment.