论文标题
促使决策变压器进行几次策略概括
Prompting Decision Transformer for Few-Shot Policy Generalization
论文作者
论文摘要
人类可以利用先前的经验,并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反,我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器(提示-DT),该决策器利用了变压器体系结构的顺序建模能力和及时的框架,以在离线RL中实现少量适应。我们设计了轨迹提示,其中包含少量演示的段,并编码特定于任务的信息来指导策略生成。我们在五个Mujoco控制基准中进行的实验表明,提示-DT是一个强大的少数学习者,而对看不见的目标任务没有任何额外的填充。提示-DT的表现优于其变体和强大的元离线RL基线,并具有很大的边距,而轨迹提示符仅包含少量时间段。提示-DT也很健壮,可以提示更改,并且可以推广到分发(OOD)环境。
Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL. We design the trajectory prompt, which contains segments of the few-shot demonstrations, and encodes task-specific information to guide policy generation. Our experiments in five MuJoCo control benchmarks show that Prompt-DT is a strong few-shot learner without any extra finetuning on unseen target tasks. Prompt-DT outperforms its variants and strong meta offline RL baselines by a large margin with a trajectory prompt containing only a few timesteps. Prompt-DT is also robust to prompt length changes and can generalize to out-of-distribution (OOD) environments.