论文标题
Lightvit:朝着轻量轻卷的视觉变压器迈进
LightViT: Towards Light-Weight Convolution-Free Vision Transformers
论文作者
论文摘要
由于缺乏电感偏见,视觉变压器(VIT)通常被认为比卷积神经网络(CNN)少。因此,最近的工作将卷积作为插件模块,并将其嵌入各种Vit对应物中。在本文中,我们认为卷积内核执行信息聚合以连接所有令牌。但是,如果这种明确的聚合可以以更均匀的方式起作用,那么它们实际上是不需要的。受此启发,我们将Lightvit作为新的轻巧VIT家族展示,以在不卷积的情况下在纯变压器块上实现更好的准确性效率平衡。具体而言,我们在VIT的自我注意力和馈送前网络(FFN)中引入了全球但有效的聚合方案,其中引入了其他可学习的令牌以捕获全球依赖性;在令牌嵌入上施加了双维通道和空间注意。实验表明,我们的模型在图像分类,对象检测和语义分割任务上取得了重大改进。例如,我们的LightVit-T仅使用0.7G拖鞋的ImageNet上达到78.7%的精度,表现优于PVTV2-B0 8.2%,而GPU的速度快11%。代码可在https://github.com/hunto/lightvit上找到。
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is available at https://github.com/hunto/LightViT.