论文标题
欧几里得:通过多选择动力学模型进行有效的无监督的增强学习
EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model
论文作者
论文摘要
无监督的强化学习(URL)提出了一个有前途的范式,可以在任务不合时宜的环境中学习有用的行为,而没有外部奖励的指导来促进各种下游任务的快速适应。以前的作品以无模型的方式着重于预训练,同时缺乏过渡动力学建模的研究,这为提高下游任务的样本效率提供了巨大的空间。为此,我们提出了一个具有多项选择动力学模型(Euclid)的有效无监督的强化学习框架,该模型引入了一种新颖的模型融合范式,以共同预先预先培训动力学模型,并在预训练阶段中无监督的勘探政策,从而更好地利用环境样本并提高下文效率效率。但是,构建一个捕获不同行为下局部动态的可推广模型仍然是一个具有挑战性的问题。我们介绍了多项选择动力学模型,该模型同时涵盖了不同行为下的不同局部动力学,该模型同时使用不同的头部来学习在无监督的预训练期间不同行为下的状态过渡,并在下游任务中选择最合适的预测头。 Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0$\pm$1.2$\%$ in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG's performance at 2M interactive steps with 20x more data.
Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0$\pm$1.2$\%$ in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG's performance at 2M interactive steps with 20x more data.