论文标题
评论家正规回归
Critic Regularized Regression
论文作者
论文摘要
离线增强学习(RL),也称为批处理RL,提供了从没有在线环境互动的大型预录制数据集中优化政策优化的前景。它解决了有关数据收集和安全成本的挑战,这两者尤其与RL的现实应用程序特别相关。不幸的是,从固定数据集中学习时,大多数非政策算法的性能差。在本文中,我们提出了一种新颖的离线RL算法,以使用评论家的调查回归(CRR)形式从数据中学习策略。我们发现,CRR表现出色,并缩放到具有高维状态和动作空间的任务 - 在广泛的基准任务上,优于几个最先进的离线RL算法。
Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.