可靠的非政策评估用于加固学习

论文标题

可靠的非政策评估用于加固学习

Reliable Off-policy Evaluation for Reinforcement Learning

论文作者

Wang, Jie, Gao, Rui, Zha, Hongyuan

论文摘要

在一个顺序决策问题中，非政策评估使用从不同行为策略生成的记录轨迹数据估算目标策略的预期累积奖励，而无需执行目标策略。由于安全性或道德问题或无法探索，在高措施环境中的强化学习（例如医疗保健和教育）通常仅限于销售环境。因此，必须在部署目标政策之前量化非政策估算的不确定性。在本文中，我们提出了一个新颖的框架，该框架使用一个或多个记录的轨迹数据提供了坚固且乐观的累积奖励估计。利用分布鲁棒优化的方法学，我们表明，通过适当选择分布不确定性集的大小，这些估计值是在随机或对抗环境下具有非反应和渐近保证的置信界。我们的结果也被推广到批量增强学习，并得到经验分析的支持。

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with non-asymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题