不确定性驱动的离线增强学习的悲观引导学习

论文标题

不确定性驱动的离线增强学习的悲观引导学习

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

论文作者

Bai, Chenjia, Wang, Lingxiao, Yang, Zhuoran, Deng, Zhihong, Garg, Animesh, Liu, Peng, Wang, Zhaoran

论文摘要

离线增强学习（RL）旨在从以前收集的数据集中学习政策，而无需探索环境。直接将非政策算法应用于离线RL通常是由于由分布外（OOD）动作引起的外推误差而失败。以前的方法通过对OOD行动的Q值进行惩罚或限制训练有素的政策以接近行为政策来解决此类问题。然而，这样的方法通常会防止超出离线数据以外的价值函数的概括，并且缺乏对OOD数据的精确表征。在本文中，我们建议对离线RL（PBRL）的悲观自举，这是一种纯粹的不确定性驱动的离线算法，而没有明确的策略约束。具体而言，PBRL通过引导Q-功能的分歧进行不确定性定量，并通过基于估计的不确定性来惩罚值函数来执行悲观更新。为了解决外推误差，我们进一步提出了一种新型的OOD抽样方法。我们表明，这种OOD采样和悲观的自举可以在线性MDP中产生可证明的不确定性量词，从而为PBRL提供了理论的基础。 D4RL基准的广泛实验表明，与最新的算法相比，PBRL的性能更好。

Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题