无限马钢筋学习的黑盒子非政策估计

论文标题

无限马钢筋学习的黑盒子非政策估计

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

论文作者

Mousavi, Ali, Li, Lihong, Liu, Qiang, Zhou, Denny

论文摘要

在许多现实生活中，例如医疗保健和机器人技术，对长马问题问题的估计很重要，在医疗保健和机器人技术中，高保真模拟器可能无法获得，并且在政策评估中很昂贵或不可能。最近，\ cite {liu18breaking}提出了一种方法，该方法避免了\ emph {Horizon curse}的典型基于重要性采样的方法。在显示出有希望的结果的同时，这种方法在实践中受到限制，因为它需要数据从\ emph {已知}行为策略的\ emph {sentary Distribution}中绘制。在这项工作中，我们提出了一种消除这种局限性的新方法。特别是，我们将问题提出解决方案，以解决某个操作员的固定点。使用复制内核Hilbert Space（RKHSS）的工具，我们开发了一个新的估计器，该估计值可以计算固定分布的重要性比率，而无需了解如何收集非政策数据。我们分析其渐近一致性和有限样本的概括。基准的实验验证了我们方法的有效性。

Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \cite{liu18breaking} proposed an approach that avoids the \emph{curse of horizon} suffered by typical importance-sampling-based methods. While showing promising results, this approach is limited in practice as it requires data be drawn from the \emph{stationary distribution} of a \emph{known} behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as solving for the fixed point of a certain operator. Using tools from Reproducing Kernel Hilbert Spaces (RKHSs), we develop a new estimator that computes importance ratios of stationary distributions, without knowledge of how the off-policy data are collected. We analyze its asymptotic consistency and finite-sample generalization. Experiments on benchmarks verify the effectiveness of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题