论文标题
具有非平稳相互依赖的任务的自主开放式学习
Autonomous Open-Ended Learning of Tasks with Non-Stationary Interdependencies
论文作者
论文摘要
自主开放式学习是机器学习和机器人技术中的一种相关方法,可以设计能够获得目标和运动技能的人造代理,而无需用户分配的任务。这种方法的一个关键问题是制定策略,以确保代理可以在最短的时间内最大程度地利用其在尽可能多的任务上的能力。事实证明,内在动机可以产生任务不足的信号,以在目标之间正确分配训练时间。虽然大多数在本质上积极进取的开放式学习集中在目标方面的工作,但只有少数人研究了对相互依存任务的自主收购,甚至更少的解决方案涉及非平稳性相互依存的情况。在以前的工作的基础上,我们在决策水平(即建立在目标之间正确选择的策略)解决了这些关键问题,并且我们提出了一个等级架构,将子任务选择视为马尔可夫决策过程,能够以内在产生的动机为基础正确地学习相互依存的技能。特别是,我们首先加深了对先前系统的分析,表明将有关架构级别(目标选择的)任务之间关系的信息结合起来的重要性。然后,我们介绍了H-Grail,这是一种新系统,通过添加一个新的学习层来扩展上一个系统,以存储自动获得的任务序列,以便在相互依存是非平稳的情况下修改它们。所有系统均在实际机器人方案中进行测试,百特机器人执行多个相互依存的达到任务。
Autonomous open-ended learning is a relevant approach in machine learning and robotics, allowing the design of artificial agents able to acquire goals and motor skills without the necessity of user assigned tasks. A crucial issue for this approach is to develop strategies to ensure that agents can maximise their competence on as many tasks as possible in the shortest possible time. Intrinsic motivations have proven to generate a task-agnostic signal to properly allocate the training time amongst goals. While the majority of works in the field of intrinsically motivated open-ended learning focus on scenarios where goals are independent from each other, only few of them studied the autonomous acquisition of interdependent tasks, and even fewer tackled scenarios where goals involve non-stationary interdependencies. Building on previous works, we tackle these crucial issues at the level of decision making (i.e., building strategies to properly select between goals), and we propose a hierarchical architecture that treating sub-tasks selection as a Markov Decision Process is able to properly learn interdependent skills on the basis of intrinsically generated motivations. In particular, we first deepen the analysis of a previous system, showing the importance of incorporating information about the relationships between tasks at a higher level of the architecture (that of goal selection). Then we introduce H-GRAIL, a new system that extends the previous one by adding a new learning layer to store the autonomously acquired sequences of tasks to be able to modify them in case the interdependencies are non-stationary. All systems are tested in a real robotic scenario, with a Baxter robot performing multiple interdependent reaching tasks.