数据确定对比语言图像预训练（剪辑）中的分布鲁棒性

论文标题

数据确定对比语言图像预训练（剪辑）中的分布鲁棒性

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

论文作者

Fang, Alex, Ilharco, Gabriel, Wortsman, Mitchell, Wan, Yuhao, Shankar, Vaishaal, Dave, Achal, Schmidt, Ludwig

论文摘要

相反，受过训练的语言图像模型，例如剪辑，Align和Basic，已经证明了多种挑战性的自然分布变化前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同，因此一个重要的问题是导致稳健性大幅增长的原因。我们通过系统的实验研究回答这个问题。具体而言，我们研究了鲁棒性增长的五种不同可能的原因：（i）训练集大小，（ii）训练分布，（iii）在训练时间进行语言监督，（iv）测试时语言监督，以及（v）对比损失函数。我们的实验表明，更多样化的训练分布是稳健性增长的主要原因，其他因素几乎没有稳健性。除了实验结果之外，我们还介绍了Imagenet捕获，这是一种来自Flickr的原始文本注释的Imagenet版本，以实现语言图像训练的进一步受控实验。

Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题