论文标题
非对抗性模仿学习及其与对抗方法的联系
Non-Adversarial Imitation Learning and its Connections to Adversarial Methods
论文作者
论文摘要
许多用于模仿学习和逆增强学习的现代方法,例如Gail或Airl,都是基于对抗性的。这些方法应用了gan,将专家对国家和行动的分布与代理商政策引起的隐性状态分布相匹配。但是,通过将模仿学习作为鞍点问题,对抗方法可能会遭受不稳定的优化,而收敛只能显示用于小型策略更新。我们通过提出一个非对抗性模仿学习的框架来解决这些问题。由此产生的算法与它们的对抗性对应物相似,因此为对抗性模仿学习方法提供了见解。最值得注意的是,我们表明AIRL是我们非对抗性配方的一个实例,它使我们能够极大地简化其推导并获得更强的收敛保证。我们还表明,我们的非对抗性配方可通过提出一种受到最新贵族算法启发的离线模仿学习方法来推导新算法,但并不依赖于小型策略更新来收敛。在我们的模拟机器人实验中,当在每次迭代中使用许多更新用于策略和歧视器时,我们用于非对抗性模仿学习的离线方法似乎表现最佳,并且表现出色的行为克隆和值得利。
Many modern methods for imitation learning and inverse reinforcement learning, such as GAIL or AIRL, are based on an adversarial formulation. These methods apply GANs to match the expert's distribution over states and actions with the implicit state-action distribution induced by the agent's policy. However, by framing imitation learning as a saddle point problem, adversarial methods can suffer from unstable optimization, and convergence can only be shown for small policy updates. We address these problems by proposing a framework for non-adversarial imitation learning. The resulting algorithms are similar to their adversarial counterparts and, thus, provide insights for adversarial imitation learning methods. Most notably, we show that AIRL is an instance of our non-adversarial formulation, which enables us to greatly simplify its derivations and obtain stronger convergence guarantees. We also show that our non-adversarial formulation can be used to derive novel algorithms by presenting a method for offline imitation learning that is inspired by the recent ValueDice algorithm, but does not rely on small policy updates for convergence. In our simulated robot experiments, our offline method for non-adversarial imitation learning seems to perform best when using many updates for policy and discriminator at each iteration and outperforms behavioral cloning and ValueDice.