Minilm：对预训练的变压器的任务不合时式压缩的深度自我注意蒸馏

论文标题

Minilm：对预训练的变压器的任务不合时式压缩的深度自我注意蒸馏

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

论文作者

Wang, Wenhui, Wei, Furu, Dong, Li, Bao, Hangbo, Yang, Nan, Zhou, Ming

论文摘要

预训练的语言模型（例如Bert（Devlin等，2018）及其变体）在NLP任务的各种任务中取得了巨大的成功。但是，这些模型通常由数亿个参数组成，由于延迟和容量限制，在现实生活中带来了微调和在线服务的挑战。在这项工作中，我们提出了一种简单有效的方法来压缩大变压器（Vaswani等，2017）基于预训练的模型，称为深度自我注意力蒸馏。小型模型（学生）是通过深层模仿大型模型（教师）在变压器网络中起着至关重要的作用的自我发场模块来训练的。具体而言，我们建议对教师的最后一个变压器层的自我发挥模块提炼，这对学生来说是有效且灵活的。此外，除了在现有作品中使用的注意力分布（即查询和键的缩放点产物）外，我们还将自我发项模块中的值之间的缩放点产物作为新的深度自我注意力知识。此外，我们表明，引入教师助理（Mirzadeh等，2019）也有助于蒸馏大型预训练的变压器模型。实验结果表明，我们的单语模型在学生模型的不同参数大小中优于最先进的基线。特别是，它使用50％的变压器参数和教师模型的计算保留了Squad 2.0的99％以上的精度和几个粘合基准任务。我们还获得了将深层自发蒸馏应用于多语言预训练模型的竞争结果。

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题