动态N：M细粒结构化稀疏注意机制

论文标题

动态N：M细粒结构化稀疏注意机制

Dynamic N:M Fine-grained Structured Sparse Attention Mechanism

论文作者

Chen, Zhaodong, Quan, Yuying, Qu, Zheng, Liu, Liu, Ding, Yufei, Xie, Yuan

论文摘要

变形金刚已成为NLP和计算机视觉等各种任务的主流解决方案。尽管他们成功了，但注意机制的高复杂性阻碍了他们被应用于潜伏期敏感的任务。为减轻这个问题而做出了巨大的努力，其中许多成功将渐近复杂性降低到线性。然而，他们中的大多数都无法在适度的序列长度下对原始的全部关注实现实用的速度，并且对训练不友好。在本文中，我们提出了DFSS，这是一种注意机制，该机制将全部注意力重量矩阵动态介绍为N：M细粒结构稀疏模式。我们提供了证明DFSS的理论和经验证据，是全部注意机制的良好近似。我们提出了一个专用的CUDA内核设计，该设计完全消除了在任意序列长度下的动态修剪开销并实现加速。我们在不同的配置下评估了1：2和2：4稀疏性，并在全注意机制上实现1.27〜1.89倍的速度。它仅需几个审计模型的填充时代即可以PAR的准确性实现，并在384至4096的不同序列长度下对任务的全面注意机制。

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. Tremendous efforts have been made to alleviate this problem, and many of them successfully reduce the asymptotic complexity to linear. Nevertheless, most of them fail to achieve practical speedup over the original full attention under moderate sequence lengths and are unfriendly to finetuning. In this paper, we present DFSS, an attention mechanism that dynamically prunes the full attention weight matrix to N:M fine-grained structured sparse pattern. We provide both theoretical and empirical evidence that demonstrates DFSS is a good approximation of the full attention mechanism. We propose a dedicated CUDA kernel design that completely eliminates the dynamic pruning overhead and achieves speedups under arbitrary sequence length. We evaluate the 1:2 and 2:4 sparsity under different configurations and achieve 1.27~ 1.89x speedups over the full-attention mechanism. It only takes a couple of finetuning epochs from the pretrained model to achieve on par accuracy with full attention mechanism on tasks from various domains under different sequence lengths from 384 to 4096.

下载PDF全文

下载文献需遵守相关版权规定

论文标题