理解和改进视觉提示：标签映射透视图

论文标题

理解和改进视觉提示：标签映射透视图

Understanding and Improving Visual Prompting: A Label-Mapping Perspective

论文作者

Chen, Aochuan, Yao, Yuguang, Chen, Pin-Yu, Zhang, Yihua, Liu, Sijia

论文摘要

我们重新访问和提高视觉提示（VP），这是一种用于视觉任务的输入提示技术。 VP可以通过简单地将通用提示（以输入扰动模式）纳入下游数据点来重新编程固定的预训练源模型，以完成目标域中的下游任务。然而，对于源类和目标类之间的毫无统治的标签映射（LM），副总裁仍保持有效，这仍然是难以捉摸的。受到上述启发，我们问：LM与VP有何相互关联？以及如何利用这种关系以提高其对目标任务的准确性？我们凝视着LM对VP的影响，并提供了一个肯定的答案，即LM更好的“质量”（通过映射精度和解释进行评估）可以一致地提高VP的有效性。这与先前缺少LM因素的艺术相反。为了优化LM，我们提出了一个新的VP框架，称为ILM-VP（基于迭代标签映射的视觉提示），该框架会自动将源标签重新映射到目标标签并逐步提高VP的目标任务准确性。此外，当使用对比度的语言图像预处理（剪辑）模型时，我们建议整合LM过程，以帮助文本提示剪辑的选择并提高目标任务准确性。广泛的实验表明，我们的提议明显胜过最先进的VP方法。如下所述，我们表明，当将Imagenet预测的RESNET-18重新编程到13个目标任务时，我们的方法的表现优于基准，例如7.9％和6.7％的转移学习精度提高到目标Flowers102和CIFAR100数据集。此外，我们对基于夹的VP的建议分别对Flowers102和DTD的精度分别提供了13.7％和7.1％的精度。我们的代码可在https://github.com/optml-group/ilm-vp上找到。

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is available at https://github.com/OPTML-Group/ILM-VP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题