VIT-P：从当地重新思考数据有效的视觉变压器

论文标题

VIT-P：从当地重新思考数据有效的视觉变压器

ViT-P: Rethinking Data-efficient Vision Transformers from Locality

论文作者

Chen, Bin, Wang, Ran, Ming, Di, Feng, Xin

论文摘要

变形金刚的最新进展为计算机视觉任务带来了新的信任。但是，在小型数据集上，变压器很难训练，并且比卷积神经网络具有低性能。我们通过引入多焦点注意偏置来使视觉变压器作为卷积神经网络的数据效率。受训练良好的VIT中的注意力距离的启发，我们限制了VIT的自我注意力具有多尺度的局部接收场。在训练过程中，接受场的大小是适应性的，因此可以学习最佳配置。我们提供的经验证据表明，适当的接受场可以减少视力变压器的训练数据量。在CIFAR100上，我们的VIT-P基本模型可实现从头开始训练的最先进的准确性（83.16％）。我们还对Imagenet进行分析，以表明我们的方法不会在大型数据集上失去准确性。

Recent advances of Transformers have brought new trust to computer vision tasks. However, on small dataset, Transformers is hard to train and has lower performance than convolutional neural networks. We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias. Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field. The size of receptive field is adaptable during training so that optimal configuration can be learned. We provide empirical evidence that proper constrain of receptive field can reduce the amount of training data for vision transformers. On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch. We also perform analysis on ImageNet to show our method does not lose accuracy on large data sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题