论文标题
Hydra注意:有效的许多头部注意力
Hydra Attention: Efficient Attention with Many Heads
论文作者
论文摘要
尽管变压器已经开始在视觉中占主导地位,但将它们应用于大图像仍然很困难。这样做的一个很大的原因是,自我发场的尺度与代币数量二次缩放,而该令牌的数量又随着图像大小而二次地缩放。在较大的图像(例如1080p)上,网络中总计算的60%以上仅用于创建和应用注意矩阵。我们通过引入Hydra注意来解决这个问题,这是视觉变压器(VIT)的极有效的关注操作。矛盾的是,这种效率来自对其极端的多头关注:通过使用尽可能多的注意力头,九头蛇的注意力在代币和没有隐藏常数的特征上是计算线性的,这使其比标准vit-b/16中的标准自我注意到,这是由token计数的因素快。此外,Hydra注意力保留了ImageNet上的高精度,在某些情况下实际上可以改善它。
While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.