促使决策变压器进行几次策略概括

论文标题

促使决策变压器进行几次策略概括

Prompting Decision Transformer for Few-Shot Policy Generalization

论文作者

Xu, Mengdi, Shen, Yikang, Zhang, Shun, Lu, Yuchen, Zhao, Ding, Tenenbaum, Joshua B., Gan, Chuang

论文摘要

人类可以利用先前的经验，并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反，我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器（提示-DT），该决策器利用了变压器体系结构的顺序建模能力和及时的框架，以在离线RL中实现少量适应。我们设计了轨迹提示，其中包含少量演示的段，并编码特定于任务的信息来指导策略生成。我们在五个Mujoco控制基准中进行的实验表明，提示-DT是一个强大的少数学习者，而对看不见的目标任务没有任何额外的填充。提示-DT的表现优于其变体和强大的元离线RL基线，并具有很大的边距，而轨迹提示符仅包含少量时间段。提示-DT也很健壮，可以提示更改，并且可以推广到分发（OOD）环境。

Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL. We design the trajectory prompt, which contains segments of the few-shot demonstrations, and encodes task-specific information to guide policy generation. Our experiments in five MuJoCo control benchmarks show that Prompt-DT is a strong few-shot learner without any extra finetuning on unseen target tasks. Prompt-DT outperforms its variants and strong meta offline RL baselines by a large margin with a trajectory prompt containing only a few timesteps. Prompt-DT is also robust to prompt length changes and can generalize to out-of-distribution (OOD) environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题