论文标题
使用知识蒸馏的深神经网络的渠道种植
Channel Planting for Deep Neural Networks using Knowledge Distillation
论文作者
论文摘要
近年来,更深和更广泛的神经网络在计算机视觉任务中表现出色,而它们的大量参数导致计算成本增加和过度拟合。已经提出了几种方法来压缩网络的大小而不降低网络性能。网络修剪可以减少网络中的冗余和不必要的参数。知识蒸馏可以将更深和更广泛的网络的知识转移到较小的网络中。这些方法获得的较小网络的性能是由预定义的网络界定的。已经提出了神经体系结构搜索,可以自动搜索网络的体系结构以打破结构限制。此外,还有一种动态配置方法可以逐步训练网络作为子网络。在本文中,我们提出了一种新型的称为种植神经网络的新型增量训练算法。我们的种植可以通过较少的参数搜索最佳网络体系结构,以通过逐步增强频道逐渐增强到初始网络的层,同时将较早的训练参数固定为固定,从而改善网络性能。另外,我们建议使用知识蒸馏方法训练种植的渠道。通过转移更深层和更广泛的网络的知识,我们可以有效,有效地发展网络。我们评估了在不同数据集(例如CIFAR-10/100和STL-10)上提出方法的有效性。对于STL-10数据集,我们表明我们能够以较大的网络相比仅使用7%的参数实现可比性的性能,并减少少量数据引起的过度拟合。
In recent years, deeper and wider neural networks have shown excellent performance in computer vision tasks, while their enormous amount of parameters results in increased computational cost and overfitting. Several methods have been proposed to compress the size of the networks without reducing network performance. Network pruning can reduce redundant and unnecessary parameters from a network. Knowledge distillation can transfer the knowledge of deeper and wider networks to smaller networks. The performance of the smaller network obtained by these methods is bounded by the predefined network. Neural architecture search has been proposed, which can search automatically the architecture of the networks to break the structure limitation. Also, there is a dynamic configuration method to train networks incrementally as sub-networks. In this paper, we present a novel incremental training algorithm for deep neural networks called planting. Our planting can search the optimal network architecture with smaller number of parameters for improving the network performance by augmenting channels incrementally to layers of the initial networks while keeping the earlier trained parameters fixed. Also, we propose using the knowledge distillation method for training the channels planted. By transferring the knowledge of deeper and wider networks, we can grow the networks effectively and efficiently. We evaluate the effectiveness of the proposed method on different datasets such as CIFAR-10/100 and STL-10. For the STL-10 dataset, we show that we are able to achieve comparable performance with only 7% parameters compared to the larger network and reduce the overfitting caused by a small amount of the data.