对离线增强学习的温和保守的Q学习

论文标题

对离线增强学习的温和保守的Q学习

Mildly Conservative Q-Learning for Offline Reinforcement Learning

论文作者

Lyu, Jiafei, Ma, Xiaoteng, Li, Xiu, Lu, Zongqing

论文摘要

离线增强学习（RL）定义了从静态记录数据集学习的任务，而无需与环境不断相互作用。学识渊博的政策与行为政策之间的分配变化使得价值函数必须保持保守，以使分布（OOD）的动作不会被严重高估。但是，现有的方法，对看不见的行为进行惩罚或与行为政策进行正规化，太悲观了，这抑制了价值功能的概括并阻碍了绩效的提高。本文探讨了温和但足够的保守主义，可以在线学习，同时又不损害概括。我们提出了轻度保守的Q学习（MCQ），其中通过分配了适当的伪Q值来积极训练OOD的作用。从理论上讲，我们表明MCQ至少与行为策略相比，MCQ引起了行为的策略，并且对OOD行动不会发生错误的高估。 D4RL基准测试的实验结果表明，与先前的工作相比，MCQ取得了出色的性能。此外，MCQ在从离线转移到在线时显示出卓越的概括能力，并且表现明显优于基准。我们的代码可在https://github.com/dmksjfl/mcq上公开获取。

Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.

下载PDF全文

下载文献需遵守相关版权规定

论文标题