论文标题
通过嘈杂示威的监督进行加强学习
Reinforcement Learning with Supervision from Noisy Demonstrations
论文作者
论文摘要
强化学习在各种应用中取得了巨大的成功。为了学习代理商的有效政策,通常需要通过与环境进行交互来大量数据,这可能是计算上的昂贵且耗时的。为了克服这一挑战,提出了称为“加强学习”的框架,以专家示范(RLED)来利用专家示范的监督。尽管RLED方法可以减少学习迭代的数量,但他们通常认为演示是完美的,因此可能会被真实应用中的嘈杂示范所误导。在本文中,我们提出了一个新颖的框架,以通过与环境共同互动并利用专家示范来适应性地学习政策。具体而言,对于演示轨迹的每个步骤,我们形成一个实例,并定义关节损失函数,以同时最大程度地提高预期奖励,并最大程度地减少主体行为和演示之间的差异。最重要的是,通过计算价值函数的预期增益,我们将每个实例分配给重量以估算其潜在效用,因此可以在过滤噪音时强调更有帮助的演示。在多种流行的增强学习算法的各种环境中的实验结果表明,拟议的方法可以通过嘈杂的演示来鲁and学习,并在更少的迭代中实现更高的性能。
Reinforcement learning has achieved great success in various applications. To learn an effective policy for the agent, it usually requires a huge amount of data by interacting with the environment, which could be computational costly and time consuming. To overcome this challenge, the framework called Reinforcement Learning with Expert Demonstrations (RLED) was proposed to exploit the supervision from expert demonstrations. Although the RLED methods can reduce the number of learning iterations, they usually assume the demonstrations are perfect, and thus may be seriously misled by the noisy demonstrations in real applications. In this paper, we propose a novel framework to adaptively learn the policy by jointly interacting with the environment and exploiting the expert demonstrations. Specifically, for each step of the demonstration trajectory, we form an instance, and define a joint loss function to simultaneously maximize the expected reward and minimize the difference between agent behaviors and demonstrations. Most importantly, by calculating the expected gain of the value function, we assign each instance with a weight to estimate its potential utility, and thus can emphasize the more helpful demonstrations while filter out noisy ones. Experimental results in various environments with multiple popular reinforcement learning algorithms show that the proposed approach can learn robustly with noisy demonstrations, and achieve higher performance in fewer iterations.