吊送：双重注意力视觉变压器

论文标题

吊送：双重注意力视觉变压器

DaViT: Dual Attention Vision Transformers

论文作者

Ding, Mingyu, Xiao, Bin, Codella, Noel, Luo, Ping, Wang, Jingdong, Yuan, Lu

论文摘要

在这项工作中，我们介绍了双重注意力视觉变形金刚（Davit），这是一种简单而有效的视觉变压器体系结构，能够在保持计算效率的同时捕获全球环境。我们提出从正交角度来解决问题的：利用“空间令牌”和“通道令牌”的自我发项机制。使用空间令牌，空间尺寸定义了令牌范围，通道尺寸定义了令牌特征维度。使用通道令牌，我们具有倒数：通道维度定义令牌范围，空间维度定义了令牌特征维度。我们进一步将空间和通道令牌的序列方向分组，以维持整个模型的线性复杂性。我们表明，这两种自我彼此相互补充：（i）由于每个通道令牌包含整个图像的抽象表示，因此通道注意力自然会通过计算所有空间位置来计算通道之间的注意分数，从而自然捕获全局相互作用和表示。（ii）空间注意力通过在空间位置进行细粒度的相互作用来完善局部表示，这反过来又有助于通道注意的全局信息建模。广泛的实验表明，通过有效的计算，我们的吊架在四个不同的任务上实现了最新的性能。如果没有额外的数据，davit-tiny，davit-small和davit-base在Imagenet-1k上的占82.8％，84.2％和84.6％的TOP-1精度，分别为2830万，4970万和8790万参数。当我们以1.5b弱监督的图像和文本对进一步扩展吊式时，Davit-Gaint在Imagenet-1K上达到90.4％的TOP-1精度。代码可在https://github.com/dingmyu/davit上找到。

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题