论文标题

无限马钢筋学习的黑盒子非政策估计

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

论文作者

Mousavi, Ali, Li, Lihong, Liu, Qiang, Zhou, Denny

论文摘要

在许多现实生活中,例如医疗保健和机器人技术,对长马问题问题的估计很重要,在医疗保健和机器人技术中,高保真模拟器可能无法获得,并且在政策评估中很昂贵或不可能。最近,\ cite {liu18breaking}提出了一种方法,该方法避免了\ emph {Horizo​​n curse}的典型基于重要性采样的方法。在显示出有希望的结果的同时,这种方法在实践中受到限制,因为它需要数据从\ emph {已知}行为策略的\ emph {sentary Distribution}中绘制。在这项工作中,我们提出了一种消除这种局限性的新方法。特别是,我们将问题提出解决方案,以解决某个操作员的固定点。使用复制内核Hilbert Space(RKHSS)的工具,我们开发了一个新的估计器,该估计值可以计算固定分布的重要性比率,而无需了解如何收集非政策数据。我们分析其渐近一致性和有限样本的概括。基准的实验验证了我们方法的有效性。

Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \cite{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods. While showing promising results, this approach is limited in practice as it requires data be drawn from the \emph{stationary distribution} of a \emph{known} behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as solving for the fixed point of a certain operator. Using tools from Reproducing Kernel Hilbert Spaces (RKHSs), we develop a new estimator that computes importance ratios of stationary distributions, without knowledge of how the off-policy data are collected. We analyze its asymptotic consistency and finite-sample generalization. Experiments on benchmarks verify the effectiveness of our approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源