论文标题
您只生活一次:单人生增强学习
You Only Live Once: Single-Life Reinforcement Learning
论文作者
论文摘要
加强学习算法通常旨在学习一种可以反复并自主完成任务的表现策略,通常是从头开始的。但是,在许多现实情况下,目标可能不是学习可以反复执行任务的政策,而只是在一次试验中成功执行一项新任务。例如,想象一个救灾机器人,其任务是从倒下的建筑物中检索一件物品,在那里它无法从人类那里得到直接监督。它必须在一个测试时间试验中检索该对象,并且必须在应对未知障碍的同时这样做,尽管它可能会在灾难之前利用其对建筑物的知识。我们将这个问题设置形式化,我们称之为单人寿增强学习(SLRL),在该方面,代理必须在没有干预的情况下完成任务,并利用其先前的经验,同时以某种形式的新颖性来争夺。 SLRL提供了一种自然的环境,以研究自主适应不熟悉情况的挑战,我们发现为标准的情节增强学习设计而设计的算法通常在这种情况下很难从分布式状态中恢复过来。在这一观察结果的推动下,我们提出了一种算法,$ Q $加权的对抗性学习(QWALE),该算法采用了分配匹配策略,该策略利用了代理商先前的经验作为新情况的指导。我们对几个单人寿命连续控制问题的实验表明,基于我们的分布匹配公式的方法取得了20-60%的成功,因为它们可以更快地从新颖的状态中恢复。
Reinforcement learning algorithms are typically designed to learn a performant policy that can repeatedly and autonomously complete a task, usually starting from scratch. However, in many real-world situations, the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once in a single trial. For example, imagine a disaster relief robot tasked with retrieving an item from a fallen building, where it cannot get direct supervision from humans. It must retrieve this object within one test-time trial, and must do so while tackling unknown obstacles, though it may leverage knowledge it has of the building before the disaster. We formalize this problem setting, which we call single-life reinforcement learning (SLRL), where an agent must complete a task within a single episode without interventions, utilizing its prior experience while contending with some form of novelty. SLRL provides a natural setting to study the challenge of autonomously adapting to unfamiliar situations, and we find that algorithms designed for standard episodic reinforcement learning often struggle to recover from out-of-distribution states in this setting. Motivated by this observation, we propose an algorithm, $Q$-weighted adversarial learning (QWALE), which employs a distribution matching strategy that leverages the agent's prior experience as guidance in novel situations. Our experiments on several single-life continuous control problems indicate that methods based on our distribution matching formulation are 20-60% more successful because they can more quickly recover from novel states.