检索授权的增强学习

论文标题

检索授权的增强学习

Retrieval-Augmented Reinforcement Learning

论文作者

Goyal, Anirudh, Friesen, Abram L., Banino, Andrea, Weber, Theophane, Ke, Nan Rosemary, Badia, Adria Puigdomenech, Guez, Arthur, Mirza, Mehdi, Humphreys, Peter C., Konyushkova, Ksenia, Sifre, Laurent, Valko, Michal, Osindero, Simon, Lillicrap, Timothy, Heess, Nicolas, Blundell, Charles

论文摘要

大多数深度强化学习（RL）算法蒸馏经验通过梯度更新到参数行为策略或价值功能中。虽然有效，但这种方法具有多种缺点：（1）它在计算上昂贵，（2）可以进行许多更新以将体验整合到参数模型中，（3）未完全集成的经验不会适当影响代理商的行为，并且（4）行为受模型能力的限制。在本文中，我们探讨了一种替代范式，在该范式中，我们训练网络以将过去经验的数据集映射到最佳行为。具体而言，我们可以直接访问经验数据集的检索过程（被参数为神经网络）增强RL代理。该数据集可以来自代理商的过去经验，专家演示或任何其他相关资源。培训了检索过程，可以从数据集检索可能在当前上下文中有用的信息，以帮助代理商更快，更有效地实现其目标。他提出的方法促进了学习剂，在测试时可以调节其在整个数据集上的行为，而不仅仅是当前状态或当前轨迹。我们将方法集成到两个不同的RL代理中：离线DQN代理和在线R2D2代理。在脱机多任务问题中，我们表明检索提示的DQN代理避免了任务干扰，并且比基线DQN代理更快地学习。在Atari上，我们表明，检索型R2D2的学习速度明显快于基线R2D2代理，并且得分较高。我们进行大量消融，以衡量我们提出的方法的组成部分的贡献。

Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题