VIT-FOD：基于视觉变压器的细粒对象鉴别器

论文标题

VIT-FOD：基于视觉变压器的细粒对象鉴别器

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

论文作者

Zhang, Zi-Chao, Chen, Zhen-Duo, Wang, Yongxin, Luo, Xin, Xu, Xin-Shun

论文摘要

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks.However, there are some limitations when applying ViT directly to FGVC.First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when处理具有复杂背景和小物体的细粒图像。标准VIT仅利用最后一层中的类令牌进行分类，这不足以提取全面的细粒信息。为了解决这些问题，我们提出了一种基于VIT的新型细粒对象鉴别器，用于FGVC任务，简称VIT-FOD。具体而言，除了VIT主链外，它还进一步引入了三个新型组件，即注意斑块组合（APC），关键区域过滤器（CRF）和互补令牌集成（CTI）。其中ininto，来自两个图像的APC零件提供了信息，以生成一个新图像，以便可以减少冗余计算。 CRF强调与判别区域相对应的代币，以生成一个新的代币，以进行微妙的特征学习。为了提取全面的信息，CTI整合了不同vit层的类令牌捕获的互补信息。我们对广泛使用的数据集进行了全面的实验，结果表明VIT-FOD能够实现最新的性能。

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks.However, there are some limitations when applying ViT directly to FGVC.First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects.Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information. To address these issues, we propose a novel ViT based fine-grained object discriminator for FGVC tasks, ViT-FOD for short. Specifically, besides a ViT backbone, it further introduces three novel components, i.e, Attention Patch Combination (APC), Critical Regions Filter (CRF), and Complementary Tokens Integration (CTI). Thereinto, APC pieces informative patches from two images to generate a new image so that the redundant calculation can be reduced. CRF emphasizes tokens corresponding to discriminative regions to generate a new class token for subtle feature learning. To extract comprehensive information, CTI integrates complementary information captured by class tokens in different ViT layers. We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题