MAFORFOR：具有多尺度注意融合的变压器网络以进行视觉识别

论文标题

MAFORFOR：具有多尺度注意融合的变压器网络以进行视觉识别

MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition

论文作者

Wang, Yunhao, Sun, Huixin, Wang, Xiaodi, Zhang, Bin, Li, Chao, Xin, Ying, Zhang, Baochang, Ding, Errui, Han, Shumin

论文摘要

Vision Transformer及其变体在各种计算机视觉任务中都表现出巨大的潜力。但是，传统的视觉变形金刚通常将重点放在一个粗糙的全球依赖上，这在全球关系和代币层面上都面临着对全球关系的学习挑战。在本文中，我们将多尺度注意力融合引入变压器（MAFormer），该融合在双流式识别框架中探讨了局部聚合和全局特征提取。我们开发了一个简单但有效的模块，以通过在代币级别学习细粒度和粗粒的特征并动态融合它们来探索变形金刚的全部潜力。我们的多尺度注意力融合（MAF）块由：i）一个本地窗户注意力分支，该分支在Windows中学习短距离交互，从而汇总了细粒度的本地功能； ii）通过新颖的全球学习进行全球特征提取，并通过下采样（GLD）操作有效地捕获整个图像中的远程上下文信息； iii）一个融合模块，可以通过注意力自我探索这两个特征的集成。我们的MAFORMER在共同的视力任务上实现了最先进的表现。特别是，Maformer-l在Imagenet上达到85.9 $ \％$ $ $ TOP-1的准确性，分别超过CSWIN-B和LV-VIT-L的1.7 $ \％$和0.6 $ \％$。在MSCOCO上，MAFORMER在对象检测上以1.7 $ \％$地图优于先前的ART CSWIN，而具有相似大小的参数的实例分割上的1.4 $ \％$，表明有可能成为一般的骨干网络。

Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$\%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$\%$ and 0.6$\%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$\%$ mAPs on object detection and 1.4$\%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.

下载PDF全文

下载文献需遵守相关版权规定

论文标题