使用视觉语言模型利用类别名称进行几个射击分类

论文标题

使用视觉语言模型利用类别名称进行几个射击分类

Exploiting Category Names for Few-Shot Classification with Vision-Language Models

论文作者

Xiao, Taihong, Wang, Zirui, Cao, Liangliang, Yu, Jiahui, Dai, Shengyang, Yang, Ming-Hsuan

论文摘要

在大规模数据上预测的视觉语言基础模型为许多视觉理解任务提供了强大的工具。值得注意的是，许多视觉模型构建了两个编码器（视觉和文本），它们可以将两种模式映射到相同的嵌入空间中。结果，学习的表示形式在图像分类等任务上实现了良好的零弹性性能。但是，当每个类别只有几个示例时，大型视觉模型的潜力通常不佳，这主要是由于大量参数和相对较少的训练数据之间的差距。本文表明，通过使用类别名称初始化分类头，我们可以显着提高少量射击分类的性能。通过提出的类别名称初始化方法，我们的模型在许多少数图像分类基准测试基准上获得了最先进的性能（例如，ImageNet的87.37％，在斯坦福汽车上的96.08％，都使用五局学习）。

Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37% on ImageNet and 96.08% on Stanford Cars, both using five-shot learning).

下载PDF全文

下载文献需遵守相关版权规定

论文标题