通过嵌入对大型动作空间的非政策评估

论文标题

通过嵌入对大型动作空间的非政策评估

Off-Policy Evaluation for Large Action Spaces via Embeddings

论文作者

Saito, Yuta, Joachims, Thorsten

论文摘要

在上下文匪徒中，非政策评估（OPE）在现实世界中已经快速采用，因为它仅使用历史日志数据可以离线评估新政策。不幸的是，当动作数量较大时，现有的OPE估计器（其中大多数是基于反相反的得分加权）会严重降解，并且可能会遭受极端偏见和差异。这挫败了从推荐系统到语言模型的许多应用程序中使用OPE。为了克服这个问题，我们提出了一个新的OPE估计器，即当动作嵌入在动作空间中提供结构时，利用边缘化的重要性权重。我们表征了所提出的估计器的偏差，方差和平方平方误差，并分析了动作嵌入对传统估计器提供统计益处的条件。除了理论分析外，我们还发现，即使由于大量作用，现有估计量崩溃，经验性绩效的改善也可以实现可靠的OPE。

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题