在混杂的马尔可夫决策过程中，通过工具变量的脱机加强学习

论文标题

在混杂的马尔可夫决策过程中，通过工具变量的脱机加强学习

Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes

论文作者

Fu, Zuyue, Qi, Zhengling, Wang, Zhaoran, Yang, Zhuoran, Xu, Yanxun, Kosorok, Michael R.

论文摘要

面对未衡量的混杂因素，我们研究了离线增强学习（RL）。由于缺乏与环境的在线互动，离线RL面临以下两个重大挑战：（i）代理可能会被未观察到的状态变量混淆；（ii）提前收集的离线数据不能为环境提供足够的覆盖范围。为了应对上述挑战，我们借助工具变量研究了混杂的MDP中的政策学习。具体而言，我们首先建立了基于和边缘化的重要性采样（MIS）的识别结果（基于混杂的MDPS中预期的总奖励）。然后，通过利用悲观和我们的认同结果，我们提出了各种政策学习方法，并具有有限样本的次级临时保证，可以在最小的数据覆盖范围和建模假设下找到最佳的课堂策略。最后，我们广泛的理论研究和一项由肾脏移植动机的数值研究表明了所提出的方法的有希望的表现。

We study the offline reinforcement learning (RL) in the face of unmeasured confounders. Due to the lack of online interaction with the environment, offline RL is facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables. Specifically, we first establish value function (VF)-based and marginalized importance sampling (MIS)-based identification results for the expected total reward in the confounded MDPs. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data coverage and modeling assumptions. Lastly, our extensive theoretical investigations and one numerical study motivated by the kidney transplantation demonstrate the promising performance of the proposed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题