论文标题

宾夕法尼亚:修剪内核共享有效的CNN推断

PENNI: Pruned Kernel Sharing for Efficient CNN Inference

论文作者

Li, Shiyu, Hanson, Edward, Li, Hai, Chen, Yiran

论文摘要

尽管最新的(SOTA)CNN在各种任务上取得了出色的性能,但它们的高计算需求和大量参数使得将这些SOTA CNN部署到资源受限的设备上变得困难。以前的有关CNN加速度的工作利用原始卷积层的低级近似值来降低计算成本。但是,这些方法在稀疏模型上很难进行,这限制了执行速度,因为CNN模型中的冗余并未充分利用。我们认为,可以以低排名的假设进行内核粒度分解,同时利用其余紧凑系数内的冗余。基于此观察,我们提出了PENNI,PENNI是一个CNN模型压缩框架,能够通过(1)通过(1)通过少量基础内核在卷积层中实现内核共享以及(2)交替调整底部和系数,并具有稀疏约束的基础和系数。实验表明,我们可以在RESNET18 CIFAR10上修剪97%的参数和92%的触控,而没有准确的损失,并且可以减少运行时记忆消耗44%,而推断潜伏期降低了53%。

Although state-of-the-art (SOTA) CNNs achieve outstanding performance on various tasks, their high computation demand and massive number of parameters make it difficult to deploy these SOTA CNNs onto resource-constrained devices. Previous works on CNN acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. Experiments show that we can prune 97% parameters and 92% FLOPs on ResNet18 CIFAR10 with no accuracy loss, and achieve 44% reduction in run-time memory consumption and a 53% reduction in inference latency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源