SIM-TRANS：用于细粒视觉分类的结构信息建模变压器

论文标题

SIM-TRANS：用于细粒视觉分类的结构信息建模变压器

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

论文作者

Sun, Hongbo, He, Xiangteng, Peng, Yuxin

论文摘要

细粒度的视觉分类（FGVC）旨在识别类似下属类别的对象，这对于人类准确的自动识别需求而言是挑战性和实用性的。大多数FGVC方法都集中在判别区域开采的注意机制研究上，同时忽略了它们的相互依存关系和组成的整体对象结构，这对于模型的歧视性信息本地化和理解能力至关重要。为了解决上述限制，我们建议结构信息建模变压器（SIM-TRANS）将对象结构信息纳入变压器，以增强判别性表示学习，以包含外观信息和结构信息。具体而言，我们将图像编码为一系列贴片令牌，并使用两个精心设计的模块构建强大的视觉变压器框架：（i）提出了结构信息学习（SIL）模块，以挖掘物体在对象范围内的空间上下文之间的关系，并借助变形金刚的自我注意力重量，进一步注入了模型，这进一步注入了模型的结构信息；（ii）引入了多级特征促进（MFB）模块，以利用类中多层次特征和对比度学习的互补，以增强功能鲁棒性，以获得准确的识别。提出的两个模块具有轻度加权，可以插入任何变压器网络并轻松地端到端训练，这仅取决于视觉变压器本身带来的注意力重量。广泛的实验和分析表明，所提出的SIM-TRANS在细颗粒的视觉分类基准上实现了最先进的性能。该代码可在https://github.com/pku-icst-mipl/sim-trans_acmmm2022上找到。

Fine-grained visual categorization (FGVC) aims at recognizing objects from similar subordinate categories, which is challenging and practical for human's accurate automatic recognition needs. Most FGVC approaches focus on the attention mechanism research for discriminative regions mining while neglecting their interdependencies and composed holistic object structure, which are essential for model's discriminative information localization and understanding ability. To address the above limitations, we propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning to contain both the appearance information and structure information. Specifically, we encode the image into a sequence of patch tokens and build a strong vision transformer framework with two well-designed modules: (i) the structure information learning (SIL) module is proposed to mine the spatial context relation of significant patches within the object extent with the help of the transformer's self-attention weights, which is further injected into the model for importing structure information; (ii) the multi-level feature boosting (MFB) module is introduced to exploit the complementary of multi-level features and contrastive learning among classes to enhance feature robustness for accurate recognition. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily, which only depends on the attention weights that come with the vision transformer itself. Extensive experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks. The code is available at https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.

下载PDF全文

下载文献需遵守相关版权规定

论文标题