非政策评估和政策优化的最小值间隔

论文标题

非政策评估和政策优化的最小值间隔

Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

论文作者

Jiang, Nan, Huang, Jiawei

论文摘要

我们使用价值函数和边缘化的重要性权重研究了极限评估（OPE）的最小值方法。尽管他们承诺要克服传统重要性抽样的指数差异，但仍然有几个关键问题：（1）它们需要函数近似，并且通常是偏差的。为了值得信赖的OPE，无论如何是否有量化偏见的？（2）它们分为两种样式（“权重学习”与“价值学习”）。我们可以统一他们吗？在本文中，我们积极回答两个问题。通过稍微更改先前方法的推导（每种样式； uehara等，2020），我们将它们统一为单个值间隔，带有一种特殊类型的双重鲁棒性，当值函数函数或重要性重量类是很好的指定时，间隔是有效的，并且该间隔是有效的，并且其长度量化了另一个类别的误解。我们的间隔还为一些最近的方法提供了统一的观点和新见解，我们进一步探讨了结果对非政策优化的探索和开发的影响，并且数据覆盖不足。

We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases? (2) They are split into two styles ("weight-learning" vs "value-learning"). Can we unify them? In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style; Uehara et al., 2020), we unify them into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class. Our interval also provides a unified view of and new insights to some recent methods, and we further explore the implications of our results on exploration and exploitation in off-policy policy optimization with insufficient data coverage.

下载PDF全文

下载文献需遵守相关版权规定

论文标题