迅速生成网络，用于调整冷冻视觉变压器的输入空间

论文标题

迅速生成网络，用于调整冷冻视觉变压器的输入空间

Prompt Generation Networks for Input-Space Adaptation of Frozen Vision Transformers

论文作者

Loedeman, Jochem, Stol, Maarten C., Han, Tengda, Asano, Yuki M.

论文摘要

随着计算机视觉中的变压器体系结构的引入，增加模型量表已被证明是实现性能和稳健性提升的清晰途径。但是，随着模型参数计数达到数十亿美元，当模型以推理API的托管托管时，经典的鉴定方法正变得越来越限制，甚至不可行。视觉输入促销学习，一种适应技术，其中学习了视觉（RGB）空间中的其他输入，已成为适应冷冻和云托管模型的潜在解决方案，既不需要访问前进通行证，也不需要后处理。然而，到目前为止，这些限制已大大恶化了适应性表现。为此，我们提出了及时生成网络（PGN），该网络（PGN）为每个数据点生成不同的提示，然后将其用于将冷冻验证的视觉模型调整为目标任务。我们表明，PGN有效地适应了预处理的模型到各种新数据集：它在12/12数据集上超过了先前的方法，甚至在5/12上的较高范围都超过了全频率，同时需要少100倍的参数。最后，我们介绍了“提示反转”技巧，可以通过该技巧在潜在空间中有效训练PGN，但在RGB输入空间中进行了推理。

With the introduction of the transformer architecture in computer vision, increasing model scale has been demonstrated as a clear path to achieving performance and robustness gains. However, with model parameter counts reaching the billions, classical finetuning approaches are becoming increasingly limiting and even unfeasible when models become hosted as inference APIs, as in NLP. Visual input-prompt learning, an adaptation technique in which additional inputs in visual (RGB) space are learned, has emerged as a potential solution for adapting frozen and cloud-hosted models, requiring neither access to the forward pass, nor post-processing. Yet so far, these constraints have deteriorated adaptation performances significantly. To this end, we propose the Prompt Generation Network (PGN) that generates a different prompt for every data point, which is then used to adapt a frozen pretrained vision model to a target task. We show that the PGN effectively adapts pretrained models to various new datasets: It surpasses previous methods by a large margin on 12/12 datasets and even outperforms full-finetuning on 5/12, while requiring 100x fewer parameters. Lastly, we introduce the "prompt inversion" trick, with which PGNs can be efficiently trained in a latent space but deployed in RGB input space for inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题