论文标题
预训练语言模型的梯度知识蒸馏
Gradient Knowledge Distillation for Pre-trained Language Models
论文作者
论文摘要
知识蒸馏(KD)是一个有效的框架,可以将知识从大规模的老师转移到一个紧凑而表现出色的学生。先前的KD练习预先训练的语言模型主要是通过使教师和学生之间的实例输出对准实例的输出来转移知识,同时忽略了重要的知识来源,即教师的梯度。梯度是教师对投入变化的反应的特征,我们认为这对学生更好地近似教师的基础映射功能是有益的。因此,我们提出梯度知识蒸馏(GKD)将梯度比对目标纳入蒸馏过程。实验结果表明,GKD的表现优于先前的KD方法,以了解学生的表现。进一步的分析表明,纳入梯度知识会使学生与老师的行为更加一致,从而大大提高了解释性。
Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning instance-wise outputs between the teacher and student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process. Experimental results show that GKD outperforms previous KD methods regarding student performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly.