论文标题
在小数据集上优化更深的变压器
Optimizing Deeper Transformers on Small Datasets
论文作者
论文摘要
人们普遍认为,从头开始训练深层变压器需要大型数据集。因此,对于小型数据集,人们通常在微调过程中在预训练的模型的顶部使用浅而简单的额外层。这项工作表明,这并不总是需要这样的情况:有了适当的初始化和优化,非常深的变压器的好处可以通过小型数据集(包括文本到SQL语义解析和逻辑阅读理解理解)来延续到具有挑战性的任务。特别是,我们成功地培训了$ 48 $的变压器,包括预先训练的罗伯塔(Roberta)的24美元微调层和24美元的关系感知的层,从头开始训练。由于培训步骤较少,没有特定于任务的预训练,我们就可以在具有挑战性的跨域文本到SQL解析基准蜘蛛方面获得最先进的性能。我们通过得出一个受到先前T-fixup工作的启发的新型数据依赖数据依赖性变压器固定上限初始化方案(DT-FIXUP)来实现这一目标。进一步的错误分析表明,增加深度可以帮助改善需要推理和结构理解的硬案例对小数据集的概括。
It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train $48$ layers of transformers, comprising $24$ fine-tuned layers from pre-trained RoBERTa and $24$ relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data-dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.