基于模型的策略优化，无监督的模型适应

论文标题

基于模型的策略优化，无监督的模型适应

Model-based Policy Optimization with Unsupervised Model Adaptation

论文作者

Shen, Jian, Zhao, Han, Zhang, Weinan, Yu, Yong

论文摘要

基于模型的强化学习方法学习了一个动力学模型，其中具有从环境中采样的真实数据并利用其生成模拟数据以得出代理。但是，由于模拟数据和实际数据之间的潜在分布不匹配，这可能导致性能降低。尽管努力减少了这种分配不匹配，但现有方法无法明确解决。在本文中，我们研究了如何由于模型估计不准确而弥合真实数据和模拟数据之间的差距，以获得更好的策略优化。首先，我们首先得出了预期收益的下限，这自然会通过对齐模拟和真实的数据分布来激发界限最大化算法。为此，我们提出了一个基于模型的新型增强学习框架AMPO，该框架引入了无监督的模型适应，以最大程度地减少来自真实数据和模拟数据的特征分布之间的积分概率度量（IPM）。用Wasserstein-1距离实例化我们的框架提供了一种基于模型的方法。从经验上讲，我们的方法在一系列连续控制基准任务上以样本效率来实现最先进的性能。

Model-based reinforcement learning methods learn a dynamics model with real data sampled from the environment and leverage it to generate simulated data to derive an agent. However, due to the potential distribution mismatch between simulated data and real data, this could lead to degraded performance. Despite much effort being devoted to reducing this distribution mismatch, existing methods fail to solve it explicitly. In this paper, we investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization. To begin with, we first derive a lower bound of the expected return, which naturally inspires a bound maximization algorithm by aligning the simulated and real data distributions. To this end, we propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data. Instantiating our framework with Wasserstein-1 distance gives a practical model-based approach. Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题