论文标题

实用的匪徒问题评估指南

A Practical Guide of Off-Policy Evaluation for Bandit Problems

论文作者

Kato, Masahiro, Abe, Kenshi, Ariu, Kaito, Yasui, Shota

论文摘要

非政策评估(OPE)是从通过不同策略获得的样本估算目标策略价值的问题。最近,应用OPE方法解决匪徒问题引起了人们的关注。为了估算策略价值的理论保证,OPE方法需要针对生成样品的目标策略和策略的各种条件。但是,现有的研究并未仔细讨论这种情况所处的实际情况,以及它们之间的差距仍然存在。本文旨在显示弥合差距的新结果。根据评估策略的属性,我们对OPE情况进行了分类。然后,在实际应用中,我们主要讨论最佳政策选择。对于这种情况,我们根据现有的OPE估计器提出一个元算法。我们在实验中使用合成和开放现实世界数据集研究了提出的概念。

Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. Recently, applying OPE methods for bandit problems has garnered attention. For the theoretical guarantees of an estimator of the policy value, the OPE methods require various conditions on the target policy and policy used for generating the samples. However, existing studies did not carefully discuss the practical situation where such conditions hold, and the gap between them remains. This paper aims to show new results for bridging the gap. Based on the properties of the evaluation policy, we categorize OPE situations. Then, among practical applications, we mainly discuss the best policy selection. For the situation, we propose a meta-algorithm based on existing OPE estimators. We investigate the proposed concepts using synthetic and open real-world datasets in experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源