论文标题
通过高保真生成行为建模的离线加强学习
Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling
论文作者
论文摘要
在离线增强学习中,加权回归是确保学习政策与行为策略保持接近并防止选择样本外动作的常见方法。在这项工作中,我们表明,由于政策模型的分布表达有限,以前的方法仍可能在培训期间选择看不见的动作,这会偏离其最初动机。为了解决这个问题,我们通过将学习的政策分解为两个部分:表达生成行为模型和动作评估模型,采用生成方法。关键的见解是,这种脱钩避免学习具有封闭形式表达式的明确参数化的策略模型。直接学习行为政策使我们能够利用生成建模的现有进步,例如基于扩散的方法,以建模各种行为。至于行动评估,我们将方法与样本中的计划技术相结合,以进一步避免选择样本外动作并提高计算效率。 D4RL数据集的实验结果表明,与最先进的离线RL方法相比,我们提出的方法具有竞争性或卓越的性能,尤其是在诸如Antmaze之类的复杂任务中。我们还从经验上证明,我们的方法可以从包含多个独特但类似成功的策略的异质数据集中成功学习,而以前的单峰政策失败了。
In offline reinforcement learning, weighted regression is a common method to ensure the learned policy stays close to the behavior policy and to prevent selecting out-of-sample actions. In this work, we show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training, which deviates from their initial motivation. To address this problem, we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model. The key insight is that such decoupling avoids learning an explicitly parameterized policy model with a closed-form expression. Directly learning the behavior policy allows us to leverage existing advances in generative modeling, such as diffusion-based methods, to model diverse behaviors. As for action evaluation, we combine our method with an in-sample planning technique to further avoid selecting out-of-sample actions and increase computational efficiency. Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as AntMaze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail.