论文标题

奖励基于偏好的强化学习中探索的不确定性

Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

论文作者

Liang, Xinran, Shu, Katherine, Lee, Kimin, Abbeel, Pieter

论文摘要

将复杂的目标传达给加强学习(RL)代理通常需要细致的奖励工程。基于偏好的RL方法能够通过积极地纳入人类反馈,即教师在两个行为之间的偏好来学习基于人类偏好的更灵活的奖励模型。但是,在当前基于偏好的RL算法中,反馈效率仍然仍然是一个问题,因为量身定制的人类反馈非常昂贵。为了解决这个问题,以前的方法主要集中于改善查询选择和策略初始化。同时,最近的探索方法已被证明是提高RL样品效率的秘诀。我们提出了一种专门针对基于偏好的RL算法的探索方法。我们的主要思想是通过基于学习的奖励来衡量新颖性来设计固有的奖励。具体来说,我们利用学识渊博的奖励模型的集合中利用分歧。我们的直觉是,学习奖励模型的分歧反映了量身定制的人类反馈的不确定性,并且可能对探索有用。我们的实验表明,与衡量其他现有的探索方法相比,从学习奖励的不确定性中获得的不确定性提高了基于偏好的RL算法的反馈和样本效率,这与其他现有的探索方法相比,这些算法的复杂机器人操纵任务。

Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from MetaWorld benchmarks, compared with other existing exploration methods that measure the novelty of state visitation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源