了解视觉变压器的鲁棒性

论文标题

了解视觉变压器的鲁棒性

Understanding The Robustness in Vision Transformers

论文作者

Zhou, Daquan, Yu, Zhiding, Xie, Enze, Xiao, Chaowei, Anandkumar, Anima, Feng, Jiashi, Alvarez, Jose M.

论文摘要

最近的研究表明，视觉变形金刚（VIT）对各种腐败表现出强大的鲁棒性。尽管该属性部分归因于自我发挥的机制，但仍然缺乏系统的理解。在本文中，我们研究了自我注意力在学习强大表示方面的作用。我们的研究是由于视觉变压器中新兴视觉分组的有趣特性所激发的，这表明自我注意力可以通过改善的中级表示来促进鲁棒性。我们进一步提出了一个完全注意力网络（粉丝）的家族，通过结合注意渠道处理设计来增强这种能力。我们在各种层次骨架上全面验证设计。我们的模型具有7680万参数的ImagEnet-1k和Imagenet-C上的最先进的87.1％精度和35.8％的MCE。我们还在两个下游任务中演示了最新的准确性和鲁棒性：语义分割和对象检测。代码可在以下网址获得：https：//github.com/nvlabs/fan。

Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题