论文标题
用于稳健微调的地球多模式混合
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
论文作者
论文摘要
预训练的多模式模型(例如夹子)提供可转移的嵌入方式,并在不同的应用中显示出令人鼓舞的结果。但是,对学习的多模式嵌入的分析相对尚未探索,并且可以提高嵌入性转移性。在这项工作中,我们观察到夹子保持了两种不同模态的分离嵌入子空间,然后我们通过均匀度平和的镜头进行调查,以测量学习的表示的质量。从理论和经验上讲,我们都表明,即使经过微调,夹子也保持均匀的均匀性和一致性。这种缺乏对齐和均匀性可能会限制嵌入的可转移性和鲁棒性。为此,我们设计了一种新的微调方法,以适应更好的表示,以更好地对齐和均匀性。首先,我们提出了一种测量多模式混合,将图像和文本的嵌入方式混合在一起,以在高晶石上产生硬性阴性样品。然后,我们将模型和原始负面因素和对比损失的原始负面因素和积极因素微调。基于有关硬度保证和限制行为的理论分析,我们证明了方法的使用是合理的。关于检索,校准,少量或零拍的分类(在分配移位),嵌入算术和图像字幕的广泛实验进一步表明,我们的方法提供了可转移的表示形式,从而实现了对各种任务的强大模型适应。代码:https://github.com/changdaeoh/multimodal-mixup
Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of uniformity-alignment to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup