使用网络灵敏度和梯度的深度学习模型压缩

论文标题

使用网络灵敏度和梯度的深度学习模型压缩

Deep learning model compression using network sensitivity and gradients

论文作者

Sakthi, Madhumitha, Yadla, Niranjan, Pawate, Raj

论文摘要

深度学习模型压缩是深度学习模型边缘部署的改进和重要领域。鉴于模型的尺寸增加及其相应的功耗，至关重要的是，降低模型大小和计算需求，而不会大幅下降模型的性能。在本文中，我们提出了用于非恢复和再培训条件的模型压缩算法。在第一种情况下，由于缺乏访问原始数据或缺乏必要的计算资源，而仅在仅访问现成的模型的情况下，该模型的重新训练是不可行的，我们建议使用网络参数的敏感性来压缩深度学习模型的bin＆pant算法。这会导致语音命令和控制模型的13倍压缩以及DeepSpeech2模型的7倍压缩。在第二种情况下，当模型可以重新训练并且准确性损失可忽略不计时需要最大的压缩时，我们提出了新型的梯度加权K-Means聚类算法（GWK）。该方法使用梯度来识别给定群集中的重要权重值，并将质心推向这些值，从而对敏感权重的重视。我们的方法有效地将产品量化与EWGS [1]算法相结合，用于量化模型的亚1位表示。我们在CIFAR10数据集上测试了我们的GWK算法，例如RESNET20，RESNET56，MOBILENETV2，并在量化模型上显示35倍压缩，与浮点数相比，准确性的绝对损失小于2％。

Deep learning model compression is an improving and important field for the edge deployment of deep learning models. Given the increasing size of the models and their corresponding power consumption, it is vital to decrease the model size and compute requirement without a significant drop in the model's performance. In this paper, we present model compression algorithms for both non-retraining and retraining conditions. In the first case where retraining of the model is not feasible due to lack of access to the original data or absence of necessary compute resources while only having access to off-the-shelf models, we propose the Bin & Quant algorithm for compression of the deep learning models using the sensitivity of the network parameters. This results in 13x compression of the speech command and control model and 7x compression of the DeepSpeech2 models. In the second case when the models can be retrained and utmost compression is required for the negligible loss in accuracy, we propose our novel gradient-weighted k-means clustering algorithm (GWK). This method uses the gradients in identifying the important weight values in a given cluster and nudges the centroid towards those values, thereby giving importance to sensitive weights. Our method effectively combines product quantization with the EWGS[1] algorithm for sub-1-bit representation of the quantized models. We test our GWK algorithm on the CIFAR10 dataset across a range of models such as ResNet20, ResNet56, MobileNetv2 and show 35x compression on quantized models for less than 2% absolute loss in accuracy compared to the floating-point models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题