利用弱监督学习的政策学习

论文标题

利用弱监督学习的政策学习

Policy Learning Using Weak Supervision

论文作者

Wang, Jingkang, Guo, Hongyi, Zhu, Zhaowei, Liu, Yang

论文摘要

大多数现有的政策学习解决方案都要求学习代理人接受高质量的监督信号，例如加强学习（RL）的精心设计的奖励或行为克隆（BC）的高质量专家演示。这些质量的监督通常是不可行的或在实践中获得的昂贵。我们的目标是建立一个统一的框架，该框架利用可用的廉价弱监管来有效地执行政策学习。为了解决这个问题，我们将“弱监督”视为来自同行代理的不完美信息，并根据与同伴代理的策略（而不是简单的协议）基于“相关协议”来评估学习代理的政策。我们的方法明确惩罚了过度适应弱监督的政策。除了理论保证外，对包括带有嘈杂奖励的RL，卑诗省的RL的广泛评估，示例较弱，标准政策共同培训表明，我们的方法会带来大量的性能改善，尤其是在学习环境的复杂性或噪音很高的情况下。

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervision" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.

下载PDF全文

下载文献需遵守相关版权规定

论文标题