朝着可定制的强化学习剂：通过在线词汇扩展实现偏好规范

论文标题

朝着可定制的强化学习剂：通过在线词汇扩展实现偏好规范

Towards customizable reinforcement learning agents: Enabling preference specification through online vocabulary expansion

论文作者

Soni, Utkarsh, Thakur, Nupur, Sreedharan, Sarath, Guan, Lin, Verma, Mudit, Marquez, Matthew, Kambhampati, Subbarao

论文摘要

人们对开发可以与人类一起工作的自动化代理人越来越兴趣。除了完成指定的任务外，毫无疑问，这种代理人将以人类的首选方式行事。这要求人类将其偏好传达给代理商。为了实现这一目标，当前的方法要么要求用户指定奖励功能，要么从要求用户比较行为的查询中进行交互学习。如果代理商使用的内部表示对人类的内部表示不必要，而对于用户来说，如果用户可以更轻松地以象征性的术语来指定他们的偏好，那么前者的方法可能会具有挑战性。在这项工作中，我们提出了Presca（通过概念获取的偏好规范），该系统允许用户根据自己理解的概念指定其偏好。 Presca在共享词汇中保留了一组此类概念。如果相关概念不在共享的词汇中，那么它就可以学习。为了使新概念更加有效地学习，Presca利用目标概念与已经知道的概念之间的因果关系。此外，我们使用一种新颖的数据增强方法来进一步减少所需的反馈。我们通过在Minecraft环境中使用PRESCA来评估PRESCA，并表明它可以有效地使代理与用户的喜好保持一致。

There is a growing interest in developing automated agents that can work alongside humans. In addition to completing the assigned task, such an agent will undoubtedly be expected to behave in a manner that is preferred by the human. This requires the human to communicate their preferences to the agent. To achieve this, the current approaches either require the users to specify the reward function or the preference is interactively learned from queries that ask the user to compare behavior. The former approach can be challenging if the internal representation used by the agent is inscrutable to the human while the latter is unnecessarily cumbersome for the user if their preference can be specified more easily in symbolic terms. In this work, we propose PRESCA (PREference Specification through Concept Acquisition), a system that allows users to specify their preferences in terms of concepts that they understand. PRESCA maintains a set of such concepts in a shared vocabulary. If the relevant concept is not in the shared vocabulary, then it is learned. To make learning a new concept more feedback efficient, PRESCA leverages causal associations between the target concept and concepts that are already known. In addition, we use a novel data augmentation approach to further reduce required feedback. We evaluate PRESCA by using it on a Minecraft environment and show that it can effectively align the agent with the user's preference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题