论文标题
Lipschitz价值迭代的政策间隔估计
Off-Policy Interval Estimation with Lipschitz Value Iteration
论文作者
论文摘要
非政策评估为仅使用观察到的数据评估不同策略或处理的影响提供了必不可少的工具。当应用于医学诊断或财务决策等高风险场景时,至关重要的是,在执行不善的政策可能非常昂贵的情况下,可证明的预期奖励的上限和下限,而不仅仅是经典的单点估计,而不仅仅是经典的单点估计。在这项工作中,我们提出了一种可证明正确的方法,用于在一般连续环境中获得间隔界限以进行非政策评估。这个想法是在所有与观测值一致的Lipschitz Q函数中搜索预期奖励的最大和最小值,这相当于解决Lipschitz功能空间上的约束优化问题。我们继续引入Lipschitz价值迭代方法来单调地拧紧间隔,这是简单而有效且可证明的收敛性的。我们证明了我们方法在一系列基准方面的实际效率。
Off-policy evaluation provides an essential tool for evaluating the effects of different policies or treatments using only observed data. When applied to high-stakes scenarios such as medical diagnosis or financial decision-making, it is crucial to provide provably correct upper and lower bounds of the expected reward, not just a classical single point estimate, to the end-users, as executing a poor policy can be very costly. In this work, we propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting. The idea is to search for the maximum and minimum values of the expected reward among all the Lipschitz Q-functions that are consistent with the observations, which amounts to solving a constrained optimization problem on a Lipschitz function space. We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent. We demonstrate the practical efficiency of our method on a range of benchmarks.