论文标题
参数效率和学生友好的知识蒸馏
Parameter-Efficient and Student-Friendly Knowledge Distillation
论文作者
论文摘要
知识蒸馏(KD)已被广泛地用于将知识从大型教师模型转移到较小的学生,在培训期间,教师的参数是固定(或部分)的。最近的研究表明,由于模型能力不匹配,这种模式可能会导致知识转移困难。为了减轻不匹配的问题,已经提出了教师学生的联合培训方法,例如在线蒸馏,但始终需要昂贵的计算成本。在本文中,我们提出了一种参数效率且对学生友好的知识蒸馏方法,即PESF-KD,以通过更新相对较少的部分参数来实现有效且充分的知识转移。从技术上讲,我们首先在数学上将不匹配作为其预测分布之间的清晰度差距,在该分布之间,我们可以通过软标签的适当平滑度来缩小这种差距。然后,我们为老师介绍一个适配器模块,仅更新适配器以获得适当的平滑度的软标签。各种基准测试的实验表明,与先进的在线蒸馏方法相比,PESF-KD可以显着降低训练成本,同时获得竞争成果。代码将在接受后发布。
Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large teacher model to the smaller students, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, teacher-student joint training methods, e.g., online distillation, have been proposed, but it always requires expensive computational cost. In this paper, we present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer by updating relatively few partial parameters. Technically, we first mathematically formulate the mismatch as the sharpness gap between their predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods. Code will be released upon acceptance.