离线政策对信心的比较：基准和基准

论文标题

离线政策对信心的比较：基准和基准

Offline Policy Comparison with Confidence: Benchmarks and Baselines

论文作者

Koul, Anurag, Phielipp, Mariano, Fern, Alan

论文摘要

决策者通常希望使用离线历史数据比较世界各州的顺序行动政策。重要的是，计算工具应为此类离线政策比较（OPC）产生置信值，以说明统计差异和有限的数据覆盖范围。然而，几乎没有工作直接评估OPC的置信度值质量。在这项工作中，我们通过对OPC（opcc）为OPC创建基准（通过将策略比较查询集中在离线增强学习中的数据集中添加到数据集中）来解决此问题。此外，我们还对一类基于模型的基线的风险与覆盖率权衡进行了经验评估。特别是，基线学习动态模型的合奏，这些模型以各种方式用于产生模拟以回答具有置信度值的查询。尽管我们的结果表明了某些基线变化的优势，但似乎有很大的改善空间可以改善未来的工作。

Decision makers often wish to use offline historical data to compare sequential-action policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the risk versus coverage trade-off for a class of model-based baselines. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题