论文标题
积极的任务推导引导深度增强学习
Active Task-Inference-Guided Deep Inverse Reinforcement Learning
论文作者
论文摘要
我们考虑了时间扩展任务的奖励学习问题。对于奖励学习,逆增强学习(IRL)是一种广泛使用的范式。给定马尔可夫决策过程(MDP)和一组任务演示,IRL学习了一个奖励功能,该功能为MDP的每个状态分配了实用值的奖励。但是,对于时间扩展任务,基本奖励函数可能无法作为MDP个体状态的函数表达。取而代之的是,可能需要考虑访问州的历史以确定当前状态的奖励。为了解决这个问题,我们提出了一种迭代算法,以学习时间扩展任务的奖励功能。在每次迭代中,算法在两个模块之间交替,一个任务推理模块渗透了基础任务结构,以及使用推断的任务结构来学习奖励功能的奖励学习模块。任务推断模块会产生一系列查询,其中每个查询是子搜索的序列。演示者通过尝试在环境中执行并观察环境的反馈来对每个查询提供二进制响应。回答查询后,任务推理模块返回一个自动机,编码其当前的任务结构假设。奖励学习模块通过自动机的状态增加了MDP的状态空间。然后,该模块使用新型的最大熵IRL算法在增强状态空间上学习奖励函数。这个迭代过程一直持续到以令人满意的性能学习奖励功能为止。实验表明,所提出的算法在时间扩展的任务上显着优于几个IRL基准。
We consider the problem of reward learning for temporally extended tasks. For reward learning, inverse reinforcement learning (IRL) is a widely used paradigm. Given a Markov decision process (MDP) and a set of demonstrations for a task, IRL learns a reward function that assigns a real-valued reward to each state of the MDP. However, for temporally extended tasks, the underlying reward function may not be expressible as a function of individual states of the MDP. Instead, the history of visited states may need to be considered to determine the reward at the current state. To address this issue, we propose an iterative algorithm to learn a reward function for temporally extended tasks. At each iteration, the algorithm alternates between two modules, a task inference module that infers the underlying task structure and a reward learning module that uses the inferred task structure to learn a reward function. The task inference module produces a series of queries, where each query is a sequence of subgoals. The demonstrator provides a binary response to each query by attempting to execute it in the environment and observing the environment's feedback. After the queries are answered, the task inference module returns an automaton encoding its current hypothesis of the task structure. The reward learning module augments the state space of the MDP with the states of the automaton. The module then proceeds to learn a reward function over the augmented state space using a novel deep maximum entropy IRL algorithm. This iterative process continues until it learns a reward function with satisfactory performance. The experiments show that the proposed algorithm significantly outperforms several IRL baselines on temporally extended tasks.