Hydra注意：有效的许多头部注意力

论文标题

Hydra注意：有效的许多头部注意力

Hydra Attention: Efficient Attention with Many Heads

论文作者

Bolya, Daniel, Fu, Cheng-Yang, Dai, Xiaoliang, Zhang, Peizhao, Hoffman, Judy

论文摘要

尽管变压器已经开始在视觉中占主导地位，但将它们应用于大图像仍然很困难。这样做的一个很大的原因是，自我发场的尺度与代币数量二次缩放，而该令牌的数量又随着图像大小而二次地缩放。在较大的图像（例如1080p）上，网络中总计算的60％以上仅用于创建和应用注意矩阵。我们通过引入Hydra注意来解决这个问题，这是视觉变压器（VIT）的极有效的关注操作。矛盾的是，这种效率来自对其极端的多头关注：通过使用尽可能多的注意力头，九头蛇的注意力在代币和没有隐藏常数的特征上是计算线性的，这使其比标准vit-b/16中的标准自我注意到，这是由token计数的因素快。此外，Hydra注意力保留了ImageNet上的高精度，在某些情况下实际上可以改善它。

While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.

下载PDF全文

下载文献需遵守相关版权规定

论文标题