加速基于变压器的语言模型，具有渐进层掉落

论文标题

加速基于变压器的语言模型，具有渐进层掉落

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

论文作者

Zhang, Minjia, He, Yuxiong

论文摘要

最近，基于变压器的语言模型在许多NLP域中都表现出了出色的性能。但是，这些模型的无监督预训练步骤遭受了难以忍受的总体计算费用。当前加速预训练的方法要么依赖于具有高级硬件的大规模并行性，要么不适用于语言建模。在这项工作中，我们提出了一种基于渐进层下降的方法，可以加快基于变压器的语言模型的训练，而不是以过度的硬件资源为代价，而是从模型体系结构变化和培训技术提高了效率。 BERT上的广泛实验表明，所提出的方法平均每样本的时间降低了24％，并且允许预训练的速度比基线快2.5倍，以在下游任务上获得类似的精度。虽然更快，但我们的预训练模型配备了强大的知识传递性，比预先训练相同数量的样品时，获得了可比性的，有时甚至更高的胶水评分。

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current methods for accelerating the pre-training either rely on massive parallelism with advanced hardware or are not applicable to language modeling. In this work, we propose a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency. Extensive experiments on BERT show that the proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline to get a similar accuracy on downstream tasks. While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline when pre-trained with the same number of samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题