观看和匹配：通过正规化的最佳运输增压模仿

论文标题

观看和匹配：通过正规化的最佳运输增压模仿

Watch and Match: Supercharging Imitation with Regularized Optimal Transport

论文作者

Haldar, Siddhant, Mathur, Vaibhav, Yarats, Denis, Pinto, Lerrel

论文摘要

模仿学习在有效地学习政策方面对复杂的决策问题有着巨大的希望。当前的最新算法通常使用逆增强学习（IRL），在给定一组专家演示的情况下，代理会替代奖励功能和相关的最佳策略。但是，这种IRL方法通常需要在复杂控制问题上进行实质性的在线互动。在这项工作中，我们提出了正规化的最佳运输（ROT），这是一种新的模仿学习算法，基于最佳基于最佳运输轨迹匹配的最新进展。我们的主要技术见解是，即使只有少数示威，将轨迹匹配的奖励与行为克隆结合起来也可以显着加速模仿。与先前的最新方法相比，我们对跨DeepMind Control Suite，OpenAI机器人套件和Meta-World基准进行的20个视觉控制任务的实验表明，平均达到7倍的模仿速度，达到90％的专家性能。在现实世界的机器人操作中，只有一次演示和一个小时的在线培训，ROT在14个任务中的平均成功率为90.1％。

Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题