保守的Q学习，用于离线增强学习

论文标题

保守的Q学习，用于离线增强学习

Conservative Q-Learning for Offline Reinforcement Learning

论文作者

Kumar, Aviral, Zhou, Aurick, Tucker, George, Levine, Sergey

论文摘要

有效利用大型的，以前收集的数据集在增强学习中（RL）是大规模现实世界应用程序的关键挑战。离线RL算法有望从先前收集的静态数据集中学习有效的策略，而无需进一步互动。但是，实际上，离线RL提出了一个重大挑战，由于数据集和学习策略之间的分布变化引起的值引起的值，因此标准的非差异RL方法可能会失败，尤其是在对复杂和多模式数据分布进行培训时。在本文中，我们提出了保守的Q学习（CQL），该文章的目的是通过学习保守的Q功能来解决这些局限性，以便在此Q函数下降低其真正价值下的政策的预期价值。从理论上讲，我们表明CQL会产生当前政策价值的下限，并且可以将其纳入具有理论改进保证的政策学习程序中。在实践中，CQL使用简单的Q值正常化程序来增强标准的Bellman错误目标，该目标很容易在现有的深Q学习和参与者 - 批评实现之上实施。在离散控制域和连续控制域上，我们表明CQL基本上要优于现有的离线RL方法，通常会学习获得最终回报率高2-5倍的策略，尤其是在从复杂和多模式数据分布中学习时。

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题