论文标题
脱机政策评估和优化
Offline Policy Evaluation and Optimization under Confounding
论文作者
论文摘要
在没有观察到的混杂因素的存在下评估和优化政策是对离线增强学习越来越兴趣的问题。在存在混淆的情况下,使用传统的方法进行离线RL不仅可以导致决定不良和政策不当,而且在医疗保健和教育等关键应用中也会产生灾难性影响。我们绘制出混杂的MDP的离线政策评估的景观,根据它们是否无记忆以及它们对数据收集策略的影响来区分混杂的假设。我们表征了一个设置,其中一致的价值估计是无法实现的,并提供保证的算法,以估算值的下限。当可以实现一致的估计时,我们会提供带有样本复杂性保证的价值估计的算法。我们还提出了新的算法,以改善离线政策并证明本地融合保证。最后,我们在网格世界环境和管理败血症患者的模拟医疗环境中对我们的算法进行实验评估。在GRIDWORLD中,我们基于模型的方法比现有方法更紧密,而在败血症模拟器中,我们的方法显着超过了合理的混杂因素。
Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.