何时信任您的模拟器：Dynamics-Aware Hybrid Offline和Inline强化学习

论文标题

何时信任您的模拟器：Dynamics-Aware Hybrid Offline和Inline强化学习

When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning

论文作者

Niu, Haoyi, Sharma, Shubham, Qiu, Yiwen, Li, Ming, Zhou, Guyue, Hu, Jianming, Zhan, Xianyuan

论文摘要

在没有高保真模拟环境的情况下，学习有效的加强学习（RL）政策可以解决现实世界中的复杂任务。在大多数情况下，我们只有具有简化动力学的不完善的模拟器，这不可避免地导致RL策略学习中的SIM到巨大差距。最近出现的离线RL领域为直接从预先收集的历史数据中学习政策提供了另一种可能性。但是，为了实现合理的性能，现有的离线RL算法需要不切实际的离线数据，并具有足够的州行动空间覆盖范围进行培训。这提出了一个新问题：是否有可能通过在线RL中的不完美模拟器中从有限的实际数据中学习有限的实际数据学习，以解决两种方法的缺点？在这项研究中，我们提出了动态感知的混合离线和对线增强学习（H2O）框架，以为这个问题提供肯定的答案。 H2O引入了动态感知的政策评估方案，该方案可以自适应地惩罚Q函数在模拟的状态行动对上具有较大的动态差距，同时允许从固定的现实世界数据集中学习。通过广泛的模拟和现实世界任务以及理论分析，我们证明了H2O与其他跨域在线和离线RL算法相对于其他跨域的表现。 H2O提供了全新的混合离线和对线RL范式，该范式可能会阐明未来的RL算法设计，以解决实用的现实世界任务。

Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题