统一的视觉和语言及时学习

论文标题

统一的视觉和语言及时学习

Unified Vision and Language Prompt Learning

论文作者

Zang, Yuhang, Li, Wei, Zhou, Kaiyang, Huang, Chen, Loy, Chen Change

论文摘要

自从剪辑（如剪辑）之类的大型视觉语言模型出现以来，迅速调整，一种仅调音模型输入空间中仅少量参数的参数传输学习范式已成为视觉社区的趋势。我们提出了一项有关两种代表性及时调整方法的系统研究，即文本及时调整和视觉及时调整。一个主要的发现是，任何单峰提示调谐方法都没有稳定地执行：文本提示调谐失败，而具有较高的类内置视觉差异的数据，而视觉提示调谐无法处理低阶层间差异。为了结合两全其美的最好的方法，我们提出了一种简单的方法，称为统一提示调整（UPT），该方法基本上学习了一个微小的神经网络，以共同优化跨不同方式的提示。在11个视觉数据集上进行的广泛实验表明，UPT比几乎没有射击的学习基准以及域通用基准的单峰同行取得了更好的权衡。代码和模型将被发布以促进未来的研究。

Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题