TDC：通过硬件感知的塔克分解在GPU上迈向极有效的CNN

论文标题

TDC：通过硬件感知的塔克分解在GPU上迈向极有效的CNN

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

论文作者

Xiang, Lizhi, Yin, Miao, Zhang, Chengming, Sukumaran-Rajam, Aravind, Sadayappan, P., Yuan, Bo, Tao, Dingwen

论文摘要

塔克分解是SOTA CNN模型压缩技术之一。但是，与减少拖船不同，我们使用现有的GPU软件（例如Cudnn）观察到使用Tucker压缩模型的推理时间非常有限。为此，我们提出了一个有效的端到端框架，该框架可以通过Tucker分解生成高度准确，紧凑的CNN模型，并在GPU上优化推理代码。具体而言，我们提出了一种基于ADMM的培训算法，可以实现高度准确的Tucker-Format模型。我们还为Tucker-Format卷积和分析性能模型开发了高性能内核，以指导执行参数的选择。我们进一步提出了一个共同设计框架，以确定由实际的推理时间（而不是失败）驱动的适当的塔克等级。我们对具有A100的五个现代CNN的评估表明，我们具有优化代码的压缩模型在CUDNN上实现了高达2.21倍的加速，TVM上的1.12倍加速和3.27倍，使用Cudnn使用CUDNN，最多可获得0.05％的精度损失。

Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21X speedup over cuDNN, 1.12X speedup over TVM, and 3.27X over the original models using cuDNN with at most 0.05% accuracy loss.

下载PDF全文

下载文献需遵守相关版权规定

论文标题