对表格增强学习的渐近效率非政策评估

论文标题

对表格增强学习的渐近效率非政策评估

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

论文作者

Yin, Ming, Wang, Yu-Xiang

论文摘要

我们考虑了对强化学习的政策评估问题，在这种问题中，目标是使用运行日志记录策略$μ$收集的离线数据估算目标策略$π$的预期奖励。基于标准的重要性采样方法对此问题的方法具有差异，该方差随时间范围$ h $呈指数缩放，这激发了近期对破坏“地平线诅咒”的替代方案的兴趣（Liu等人，2018年，Xie等，2019）。特别是，可以使用边缘化的重要性采样（MIS）方法来实现在具有有限状态和潜在无限动作的情节性马可分之为决策过程模型下，在均方根误差（MSE）中以均值$ O（H^3/ N）$的估计误差。然而，MSE绑定仍然是$ h $ a $ h $的因子，距订单$ω（H^2/n）$的cramer-rao下限。在本文中，我们证明，通过对MIS估计器进行简单的修改，只要动作空间是有限的，我们就可以渐近地达到Cramer-Rao下限。我们还提供了一种通用方法，用于构建具有高概率误差界限的MIS估计器。

We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $π$ using offline data collected by running a logging policy $μ$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $Ω(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题