论文标题

关于知识蒸馏的神秘化:残留网络的观点

On the Demystification of Knowledge Distillation: A Residual Network Perspective

论文作者

Jha, Nandan Kumar, Saini, Rajat, Mittal, Sparsh

论文摘要

知识蒸馏(KD)通常被视为执行模型压缩和学习标签平滑的技术。但是,在本文中,我们从新的角度研究和研究了KD方法:我们研究了其在训练更深的网络而没有任何残留连接的效力。我们发现,在大多数情况下,非残基学生网络的性能均等或比在没有KD(基线网络)的原始数据训练的剩余版本(基线网络)中表现出色。令人惊讶的是,在某些情况下,即使在劣等教师中,它们也超过了基线网络的准确性。经过一定深度的非残基学生网络深度之后,由于取消残留联系的准确性下降是实质性的,而KD的培训则在很大程度上提高了学生的准确性。但是,它不能完全恢复准确性下降。此外,我们观察到,KD的传统教师观点是不完整的,并且无法充分解释我们的发现。我们提出了对KD的新颖解释,并提供了学员假设,该假设提供了KD的整体观点。我们还提出了两个观点,即损失景观和功能重复使用,以解释剩余连接与KD之间的相互作用。我们通过对剩余网络的广泛实验来证实我们的主张。

Knowledge distillation (KD) is generally considered as a technique for performing model compression and learned-label smoothing. However, in this paper, we study and investigate the KD approach from a new perspective: we study its efficacy in training a deeper network without any residual connections. We find that in most of the cases, non-residual student networks perform equally or better than their residual versions trained on raw data without KD (baseline network). Surprisingly, in some cases, they surpass the accuracy of baseline networks even with the inferior teachers. After a certain depth of non-residual student network, the accuracy drop, coming from the removal of residual connections, is substantial, and training with KD boosts the accuracy of the student up to a great extent; however, it does not fully recover the accuracy drop. Furthermore, we observe that the conventional teacher-student view of KD is incomplete and does not adequately explain our findings. We propose a novel interpretation of KD with the Trainee-Mentor hypothesis, which provides a holistic view of KD. We also present two viewpoints, loss landscape, and feature reuse, to explain the interplay between residual connections and KD. We substantiate our claims through extensive experiments on residual networks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源