视觉变压器压缩带有结构化修剪和低等级近似

论文标题

视觉变压器压缩带有结构化修剪和低等级近似

Vision Transformer Compression with Structured Pruning and Low Rank Approximation

论文作者

Kumar, Ankur

论文摘要

变压器体系结构由于其具有大型数据集扩展的能力而变得越来越受欢迎。因此，有必要降低模型大小和延迟，尤其是对于设备部署。我们专注于为图像识别任务提出的视觉变压器（Dosovitskiy等，2021），并探讨了为此目的的不同压缩技术（例如低等级近似和修剪）的应用。具体而言，我们研究了最近在Zhu等人中提出的结构化修剪方法。（2021）并发现大多数是通过这种方法修剪的前馈块，这也具有严重的准确性降解。我们提出了一种混合压缩方法来减轻这种方法，在该方法中，我们使用低级近似值来压缩注意力块，并使用前面提到的修剪，在每个变压器层中的前馈块较低的速率。我们的技术导致50％的压缩，分类误差相对相对增加14％，而仅应用修剪时，我们获得44％的压缩，相对误差相对增加20％。我们提出进一步的增强，以弥合准确性差距，但将其作为未来的工作。

Transformer architecture has gained popularity due to its ability to scale with large dataset. Consequently, there is a need to reduce the model size and latency, especially for on-device deployment. We focus on vision transformer proposed for image recognition task (Dosovitskiy et al., 2021), and explore the application of different compression techniques such as low rank approximation and pruning for this purpose. Specifically, we investigate a structured pruning method proposed recently in Zhu et al. (2021) and find that mostly feedforward blocks are pruned with this approach, that too, with severe degradation in accuracy. We propose a hybrid compression approach to mitigate this where we compress the attention blocks using low rank approximation and use the previously mentioned pruning with a lower rate for feedforward blocks in each transformer layer. Our technique results in 50% compression with 14% relative increase in classification error whereas we obtain 44% compression with 20% relative increase in error when only pruning is applied. We propose further enhancements to bridge the accuracy gap but leave it as a future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题