在稀疏奖励环境中使用示范的演示增强的元加强学习

论文标题

在稀疏奖励环境中使用示范的演示增强的元加强学习

Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments

论文作者

Rengarajan, Desik, Chaudhary, Sapana, Kim, Jaewon, Kalathil, Dileep, Shakkottai, Srinivas

论文摘要

元加强学习（META-RL）是一种方法，即从解决各种任务中获得的经验被蒸馏成元元素。当仅适应一个小（或仅一个）数量的步骤时，元派利赛能够在新的相关任务上近距离执行。但是，采用这种方法来解决现实世界中的问题的主要挑战是，它们通常与稀疏的奖励功能相关联，这些功能仅表明任务是部分还是完全完成。我们考虑到某些数据可能由亚最佳代理生成的情况，可用于每个任务。然后，我们使用示范（EMRLD）开发了一种标题为“增强元RL”的算法，即使在训练过程中以次优的方式获得指导，也可以利用此信息。我们展示了EMRLD如何共同利用RL和在离线数据上进行监督学习，以生成一个显示单调性能改善的元元素。我们还开发了一个称为EMRLD-WS的温暖开始的变体，该变体对于亚最佳演示数据特别有效。最后，我们表明，在包括移动机器人在内的各种稀疏奖励环境中，我们的EMRLD算法在各种稀疏的奖励环境中的表现显着优于现有方法。

Meta reinforcement learning (Meta-RL) is an approach wherein the experience gained from solving a variety of tasks is distilled into a meta-policy. The meta-policy, when adapted over only a small (or just a single) number of steps, is able to perform near-optimally on a new, related task. However, a major challenge to adopting this approach to solve real-world problems is that they are often associated with sparse reward functions that only indicate whether a task is completed partially or fully. We consider the situation where some data, possibly generated by a sub-optimal agent, is available for each task. We then develop a class of algorithms entitled Enhanced Meta-RL using Demonstrations (EMRLD) that exploit this information even if sub-optimal to obtain guidance during training. We show how EMRLD jointly utilizes RL and supervised learning over the offline data to generate a meta-policy that demonstrates monotone performance improvements. We also develop a warm started variant called EMRLD-WS that is particularly efficient for sub-optimal demonstration data. Finally, we show that our EMRLD algorithms significantly outperform existing approaches in a variety of sparse reward environments, including that of a mobile robot.

下载PDF全文

下载文献需遵守相关版权规定

论文标题