重试而不是尝试更长的时间：事先学习自动课程学习

论文标题

重试而不是尝试更长的时间：事先学习自动课程学习

Trying AGAIN instead of Trying Longer: Prior Learning for Automatic Curriculum Learning

论文作者

Portelas, Rémy, Hofmann, Katja, Oudeyer, Pierre-Yves

论文摘要

深度RL（DRL）社区的一个主要挑战是培训能够概括不见的情况的代理商，这通常是通过对各种任务（或环境）进行培训来训练的。促进多样性的一种强大方法是通过从多维分布中对其参数进行采样，从而在程序上生成任务，这尤其使每个培训情节都提出了不同的任务。实际上，为了获得概括所需的高度多样性培训任务，必须使用复杂的程序生成系统。对于这样的发电机，很难在实际上可以学习的任务子集（许多生成的任务都是无法完成的），它们的相对难度是什么，什么是最有效的培训任务分配订购，这是什么是什么。在这种情况下，典型的解决方案是依靠某种形式的自动课程学习（ACL）来适应采样分布。当前方法的限制是他们需要探索随着时间的流逝检测进度壁ni的任务空间，从而导致时间损失。此外，我们假设训练数据中诱导的噪声可能会损害脆性DRL学习者的性能。我们通过提出两级ACL方法来解决这个问题，其中1）教师算法首先学会了使用高探索课程来培训DRL代理，然后2）蒸发从第一次运行中汲取的先验来生成“专家课程”以重新从头开始使用相同的代理。除了平均表现出比目前的最新状态相比，这项工作的目的平均是50％的改善外，还要为一个新的研究方向提供一个新的研究指导，以精炼ACL技术，以精炼多个学习者，我们称之为课堂教学。

A major challenge in the Deep RL (DRL) community is to train agents able to generalize over unseen situations, which is often approached by training them on a diversity of tasks (or environments). A powerful method to foster diversity is to procedurally generate tasks by sampling their parameters from a multi-dimensional distribution, enabling in particular to propose a different task for each training episode. In practice, to get the high diversity of training tasks necessary for generalization, one has to use complex procedural generation systems. With such generators, it is hard to get prior knowledge on the subset of tasks that are actually learnable at all (many generated tasks may be unlearnable), what is their relative difficulty and what is the most efficient task distribution ordering for training. A typical solution in such cases is to rely on some form of Automated Curriculum Learning (ACL) to adapt the sampling distribution. One limit of current approaches is their need to explore the task space to detect progress niches over time, which leads to a loss of time. Additionally, we hypothesize that the induced noise in the training data may impair the performances of brittle DRL learners. We address this problem by proposing a two stage ACL approach where 1) a teacher algorithm first learns to train a DRL agent with a high-exploration curriculum, and then 2) distills learned priors from the first run to generate an "expert curriculum" to re-train the same agent from scratch. Besides demonstrating 50% improvements on average over the current state of the art, the objective of this work is to give a first example of a new research direction oriented towards refining ACL techniques over multiple learners, which we call Classroom Teaching.

下载PDF全文

下载文献需遵守相关版权规定

论文标题