通过Bandit算法从依赖样品中评估的置信区间：标准化的Martingales的方法

论文标题

通过Bandit算法从依赖样品中评估的置信区间：标准化的Martingales的方法

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

论文作者

Kato, Masahiro

论文摘要

这项研究解决了通过Bandit算法获得的依赖样品中的非政策评估（OPE）的问题。 OPE的目的是使用从强盗算法产生的行为策略获得的历史数据来评估新政策。由于Bandit算法基于过去的观察结果更新了策略，因此样本不是独立的，并且分布相同（I.I.D。）。但是，几种现有的OPE方法没有考虑到此问题，并且基于以下假设：样本是I.I.D.在这项研究中，我们通过从标准化的Martingale差异序列中构造估计器来解决此问题。为了标准化序列，我们考虑使用评估数据或两步估计的样品分裂。该技术会产生一个具有渐近正态性的估计器，而无需限制一类行为策略。在实验中，提出的估计器的性能要比现有方法更好，后者假定行为策略会收敛到时间不变的策略。

This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d. In this study, we address this problem by constructing an estimator from a standardized martingale difference sequence. To standardize the sequence, we consider using evaluation data or sample splitting with a two-step estimation. This technique produces an estimator with asymptotic normality without restricting a class of behavior policies. In an experiment, the proposed estimator performs better than existing methods, which assume that the behavior policy converges to a time-invariant policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题