Wasserstein距离引导的对抗性模仿学习与奖励形状探索

论文标题

Wasserstein距离引导的对抗性模仿学习与奖励形状探索

Wasserstein Distance guided Adversarial Imitation Learning with Reward Shape Exploration

论文作者

Zhang, Ming, Wang, Yawei, Ma, Xiaoteng, Xia, Li, Yang, Jun, Li, Zhiheng, Li, Xiu

论文摘要

生成的对抗性模仿学习（GAIL）为从高维连续任务中的演示中模仿专家政策提供了一个对抗性学习框架。但是，几乎所有的盖尔及其扩展仅在对抗性训练策略中使用Jensen-Shannon（JS）差异在所有复杂环境中设计了对数形式的奖励功能。固定的对数类型的奖励功能可能很难解决所有复杂的任务，而由JS差异引起的消失梯度问题会损害对抗性学习过程。在本文中，我们提出了一种新的算法，名为Wasserstein距离的对抗性模仿学习（WDAIL），用于促进模仿学习的表现（IL）。我们的方法有三个改进：（a）在对抗性训练过程中引入更合适的措施，（b）在加固学习阶段使用近端策略优化（PPO）进行更合适的措施，这更容易实施并使算法更有效地实施，并且（c）探索不同的奖励功能形状以改善不同的任务，以改善不同的任务。实验结果表明，学习过程仍然非常稳定，并且在复杂的穆乔科（Mujoco）的复杂连续控制任务中表现出色。

The generative adversarial imitation learning (GAIL) has provided an adversarial learning framework for imitating expert policy from demonstrations in high-dimensional continuous tasks. However, almost all GAIL and its extensions only design a kind of reward function of logarithmic form in the adversarial training strategy with the Jensen-Shannon (JS) divergence for all complex environments. The fixed logarithmic type of reward function may be difficult to solve all complex tasks, and the vanishing gradients problem caused by the JS divergence will harm the adversarial learning process. In this paper, we propose a new algorithm named Wasserstein Distance guided Adversarial Imitation Learning (WDAIL) for promoting the performance of imitation learning (IL). There are three improvements in our method: (a) introducing the Wasserstein distance to obtain more appropriate measure in the adversarial training process, (b) using proximal policy optimization (PPO) in the reinforcement learning stage which is much simpler to implement and makes the algorithm more efficient, and (c) exploring different reward function shapes to suit different tasks for improving the performance. The experiment results show that the learning procedure remains remarkably stable, and achieves significant performance in the complex continuous control tasks of MuJoCo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题