论文标题
通过模仿功能来提炼知识
Distilling Knowledge by Mimicking Features
论文作者
论文摘要
知识蒸馏(KD)是借助高容量网络(“老师”)培训高效网络(“学生”)的流行方法。传统方法使用教师的软逻辑作为额外的监督来训练学生网络。在本文中,我们认为使学生模仿老师在倒数第二层中的特征更为有利。不仅学生可以直接从教师功能中学习更有效的信息,还可以将模仿的功能应用于未经软马克斯层的教师。实验表明,它比传统KD获得更高的准确性。为了进一步促进模仿特征,我们将特征向量分解到大小和方向上。我们认为,老师应该为学生特征的规模提供更多的自由,并让学生更多地关注模仿功能方向。为了满足这一要求,我们提出了一个基于对区域敏感的哈希(LSH)的损失项。在这种新损失的帮助下,我们的方法确实更准确地模拟了特征说明,放松了特征大小的约束,并实现了最新的蒸馏精度。我们提供了理论分析,分析了LSH如何促进模仿方向,并进一步扩展了模仿多标签识别和对象检测的特征。
Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction. To meet this requirement, we propose a loss term based on locality-sensitive hashing (LSH). With the help of this new loss, our method indeed mimics feature directions more accurately, relaxes constraints on feature magnitudes, and achieves state-of-the-art distillation accuracy. We provide theoretical analyses of how LSH facilitates feature direction mimicking, and further extend feature mimicking to multi-label recognition and object detection.