论文标题
哪种语言模型架构和预处理目标最适合零击概括?
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
论文作者
论文摘要
已显示出大型的经过验证的变压器语言模型表现出零拍的概括,即他们可以执行未经明确训练的各种任务。但是,在最先进模型中使用的架构和预处理目标有很大差异,并且对这些因素进行了有限的系统比较。在这项工作中,我们对建模选择及其对零弹性概括的影响进行了大规模评估。特别是,我们专注于文本到文本模型,并使用三种模型体系结构(因果/非因果解码器和编码器编码器)进行了实验,并接受了两个不同的预处理目标(自动性和掩盖的语言建模),并在没有多任务促使FINETETUNNEN的情况下进行评估。我们以超过50亿个参数的训练,以超过1,700亿个代币的形式训练模型,从而增加了我们的结论将转移到更大范围的可能性。我们的实验表明,接受自回归语言建模目标训练的仅因果解码器模型在纯粹无监督的预处理后表现出最强的零拍概括。但是,在其输入上具有掩盖语言建模目标训练的输入的模型,随后进行了多任务框,在我们的实验中表现最好。因此,我们考虑了跨架构和目标的验证模型的适应。我们发现,使用自回归语言建模作为下游任务,可以将验证的非毒物解码器模型改编为性能生成的因果解码器模型。此外,我们发现预验证的因果解码器模型可以有效地适应非毒物解码器模型,最终在多任务列出后实现了竞争性能。代码和检查点可在https://github.com/bigscience-workshop/architecture-objective上找到。
Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.