论文标题
用于加速N:M稀疏变压器的算法 - 硬件合作式框架
An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers
论文作者
论文摘要
在深度学习中,变压器一直是必不可少的主食。但是,对于现实生活中的应用程序,由于模型的巨大参数和操作,部署有效的变压器非常具有挑战性。为了减轻这种负担,利用稀疏是加速变压器的有效方法。新出现的Ampere GPU利用2:4的稀疏模式来实现模型加速度,而在部署模型时,它几乎无法满足各种算法和硬件约束。相比之下,我们提出了一个算法 - 铁软件合作的框架,以通过使用一般的N:M稀疏模式灵活有效地加速变压器。 (1)从算法的角度来看,我们提出了一种稀疏性遗传机制以及遗传的动态修剪(IDP)方法,以迅速获得一系列N:M稀疏候选变压器。进一步提出了模型压缩方案,以大大减少部署的存储要求。 (2)从硬件角度来看,我们提出了一种灵活,有效的硬件体系结构,即STA,以在部署N:M稀疏变压器时达到显着加速。 STA不仅具有统一稀疏密度和致密的矩阵乘积具有较高计算效率的计算引擎,而且还具有可扩展的软磁模块,从而消除了中级外芯片外数据通信的延迟。实验结果表明,与其他使用IDP生成的其他方法相比,n:m稀疏变压器的准确性平均提高了6.7%。此外,与Intel I9-9900X和NVIDIA RTX 2080 TI相比,STA可以达到14.47倍和11.33倍的速度,并且比最先进的基于FPGA的加速器对变压器的最先进的推断速度分别可以达到2.00-19.47倍。
The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere GPUs leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. (1) From algorithm perspective, we propose a sparsity inheritance mechanism along with an inherited dynamic pruning (IDP) method to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed to significantly reduce the storage requirement for deployment. (2) From hardware perspective, we present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers. STA features not only a computing engine unifying both sparse-dense and dense-dense matrix multiplications with high computational efficiency but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency. Moreover, STA can achieve 14.47x and 11.33x speedup compared to Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform 2.00-19.47x faster inference than the state-of-the-art FPGA-based accelerators for Transformers.