基于病例的逆增强学习使用时间连贯性

论文标题

基于病例的逆增强学习使用时间连贯性

Case-Based Inverse Reinforcement Learning Using Temporal Coherence

论文作者

Nüßlein, Jonas, Illium, Steffen, Müller, Robert, Gabor, Thomas, Linnhoff-Popien, Claudia

论文摘要

在模仿学习的背景下，提供专家轨迹通常是昂贵且耗时的。因此，目标必须是创建需要尽可能少的专家数据的算法。在本文中，我们提出了一种算法，该算法模仿了专家的高级战略，而不仅仅是模仿行动水平的专家，我们假设这需要更少的专家数据并使培训更加稳定。作为先验，我们假设高级策略是达到未知的目标状态区域，我们假设这对于强化学习中许多领域是有效的先验。目标国家区域未知，但是由于专家已经证明了如何达到目标，因此代理商试图达到与专家类似的州。我们的算法以时间连贯性的思想为基础，训练神经网络，以预测两个状态是否相似，从某种意义上说它们可能会随着时间的流逝而发生。在推论期间，代理将其当前状态与案例基础的专家状态进行比较以获得相似性。结果表明，我们的方法仍然可以在很少有专家数据的设置中学习一个近乎最佳的政策，这些算法试图模仿动作级别的专家，这一算法再也无法做到了。

Providing expert trajectories in the context of Imitation Learning is often expensive and time-consuming. The goal must therefore be to create algorithms which require as little expert data as possible. In this paper we present an algorithm that imitates the higher-level strategy of the expert rather than just imitating the expert on action level, which we hypothesize requires less expert data and makes training more stable. As a prior, we assume that the higher-level strategy is to reach an unknown target state area, which we hypothesize is a valid prior for many domains in Reinforcement Learning. The target state area is unknown, but since the expert has demonstrated how to reach it, the agent tries to reach states similar to the expert. Building on the idea of Temporal Coherence, our algorithm trains a neural network to predict whether two states are similar, in the sense that they may occur close in time. During inference, the agent compares its current state with expert states from a Case Base for similarity. The results show that our approach can still learn a near-optimal policy in settings with very little expert data, where algorithms that try to imitate the expert at the action level can no longer do so.

下载PDF全文

下载文献需遵守相关版权规定

论文标题