基于能量的模仿学习

论文标题

基于能量的模仿学习

Energy-Based Imitation Learning

论文作者

Liu, Minghuan, He, Tairan, Xu, Minkai, Zhang, Weinan

论文摘要

我们解决了模仿学习（IL）中常见的方案，代理商试图从专家示范中恢复最佳政策，而无需进一步访问专家或环境奖励信号。除了采用监督学习的简单行为克隆（BC）之外，再加上复合错误的问题，以前的解决方案，例如逆增强学习（IRL）和最近的生成对抗方法，涉及更新奖励功能的双层或交替优化，以更新奖励功能和策略，遭受高计算成本和高度计算成本和训练的不稳定。受基于能量的模型（EBM）的最新进展的启发，我们在本文中提出了一个简化的IL IL框架，名为基于能量的模仿学习（EBIL）。 Ebil没有通过一个简单且灵活的两阶段解决方案来划出传统的IRL范式：首先将专家能量估算为通过得分匹配，然后利用这种奖励来通过增强学习算法来学习政策，而不是通过延长传统的IRL范式来更新奖励和政策。 EBIL结合了EBM和占用度量匹配的概念，通过理论分析，我们表明EBIL和最大渗透IRL（Maxent IRL）方法是同一枚硬币的两个方面，因此EBIL可以作为对抗性IRL方法的替代方法。关于定性和定量评估的广泛实验表明，Ebil能够恢复有意义的解释性奖励信号，同时实现与IL基准上现有算法的有效性和可比性能。

We tackle a common scenario in imitation learning (IL), where agents try to recover the optimal policy from expert demonstrations without further access to the expert or environment reward signals. Except the simple Behavior Cloning (BC) that adopts supervised learning followed by the problem of compounding error, previous solutions like inverse reinforcement learning (IRL) and recent generative adversarial methods involve a bi-level or alternating optimization for updating the reward function and the policy, suffering from high computational cost and training instability. Inspired by recent progress in energy-based model (EBM), in this paper, we propose a simplified IL framework named Energy-Based Imitation Learning (EBIL). Instead of updating the reward and policy iteratively, EBIL breaks out of the traditional IRL paradigm by a simple and flexible two-stage solution: first estimating the expert energy as the surrogate reward function through score matching, then utilizing such a reward for learning the policy by reinforcement learning algorithms. EBIL combines the idea of both EBM and occupancy measure matching, and via theoretic analysis we reveal that EBIL and Max-Entropy IRL (MaxEnt IRL) approaches are two sides of the same coin, and thus EBIL could be an alternative of adversarial IRL methods. Extensive experiments on qualitative and quantitative evaluations indicate that EBIL is able to recover meaningful and interpretative reward signals while achieving effective and comparable performance against existing algorithms on IL benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题