MOPO：基于模型的离线政策优化

论文标题

MOPO：基于模型的离线政策优化

MOPO: Model-based Offline Policy Optimization

论文作者

Yu, Tianhe, Thomas, Garrett, Yu, Lantao, Ermon, Stefano, Zou, James, Levine, Sergey, Finn, Chelsea, Ma, Tengyu

论文摘要

离线增强学习（RL）完全是指学习政策的问题，完全来自一大批先前收集的数据。此问题设置提供了利用此类数据集获取政策而没有任何昂贵或危险的主动探索的希望。但是，由于离线培训数据与学识渊博的政策访问的国家之间的分配变化，这也是一项挑战。尽管最近取得了重大进展，但最成功的先前方法是无模型的，并将策略限制为支持数据，从而排除了看不见的状态的概括。在本文中，我们首先观察到，与无模型方法相比，现有的基于模型的RL算法已经在离线设置中产生显着增长。但是，为在线设置设计的基于标准模型的RL方法没有提供明确的机制来避免离线设置的分配变速问题。取而代之的是，我们建议通过以人为惩罚动态的不确定性来应用奖励来修改现有的基于模型的RL方法。从理论上讲，我们表明该算法最大化了在真实MDP下的策略返回的下限。我们还表征了离开批处理数据支持的收益和风险之间的权衡。我们的算法，基于模型的离线策略优化（MOPO），胜过基于标准模型的RL算法和现有离线RL基准测试和两个挑战性连续控制任务的基于标准的RL算法和先前的无模型离线RL算法，这些任务需要从收集的数据中概括为不同的任务。该代码可在https://github.com/tianheyu927/mopo上找到。

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. The code is available at https://github.com/tianheyu927/mopo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题