论文标题

通过乐观的政策搜索和计划,有效的基于模型的强化学习

Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

论文作者

Curi, Sebastian, Berkenkamp, Felix, Krause, Andreas

论文摘要

具有概率动力学模型的基于模型的增强学习算法是最具数据效率的学习方法之一。这通常归因于他们区分认知和核心不确定性的能力。但是,尽管大多数算法都区分了这两个学习模型的不确定性,但在优化政策时,它们会忽略它,从而导致贪婪和不足的探索。同时,没有用于乐观探索算法的实用求解器。在本文中,我们提出了一种实用的乐观探索算法(H-UCRL)。 H-UCRL对合理模型的集合进行重新聚集,并直接对认知不确定性进行幻觉控制。通过使用幻觉输入来增强输入空间,可以使用标准贪婪的计划者来解决H-UCRL。此外,我们分析了H-UCRL并为良好的模型构建了一个普遍的遗憾,事实证明,在高斯工艺模型的情况下,这是sublitear。基于这个理论基础,我们展示了如何轻松地将乐观的探索与最新的强化学习算法和不同的概率模型相结合。我们的实验表明,当对动作受到惩罚时,乐观的探索会大大加快学习的速度,这对于现有的基于模型的增强学习算法而言,这是很难的。

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for learning the model, they ignore it when optimizing the policy, which leads to greedy and insufficient exploration. At the same time, there are no practical solvers for optimistic exploration algorithms. In this paper, we propose a practical optimistic exploration algorithm (H-UCRL). H-UCRL reparameterizes the set of plausible models and hallucinates control directly on the epistemic uncertainty. By augmenting the input space with the hallucinated inputs, H-UCRL can be solved using standard greedy planners. Furthermore, we analyze H-UCRL and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源