SFP：非国家探索的先验，以探索非政策的强化学习

论文标题

SFP：非国家探索的先验，以探索非政策的强化学习

SFP: State-free Priors for Exploration in Off-Policy Reinforcement Learning

论文作者

Bagatella, Marco, Christen, Sammy, Hilliges, Otmar

论文摘要

有效的探索是深度强化学习的关键挑战。几种方法，例如行为先验，能够利用离线数据，以便在复杂的任务上有效加速加强学习。但是，如果手动的任务与所证明的任务过度偏离，则此类方法的有效性是有限的。在我们的工作中，我们建议从离线数据中学习功能，这些功能由更加多样化的任务共享，例如动作与定向之间的相关性。因此，我们介绍了无国有先验，该先验直接建模了所示轨迹中的时间一致性，即使在对简单任务收集的数据进行培训时，也能够在复杂的任务中推动探索。此外，我们通过从政策和行动的概率混合物中动态采样动作来提出了一种新颖的集成方案，以在非政策强化学习中进行动作研究。我们将我们的方法与强大的基线相提并论，并提供了经验证据，表明它可以加速在稀疏奖励环境下长期连续控制任务中的加强学习。

Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-free priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题