论文标题
培训视觉语言模型,具有较少的双峰监督
Training Vision-Language Models with Less Bimodal Supervision
论文作者
论文摘要
预处理多模型(例如视觉模型)的标准实践是依靠两种模式的一对对齐输入对,例如,对齐的图像文本对。但是,在低资源设置和某些模态对(例如结构化表和图像)中,这种对很难获得。在这项工作中,我们调查了可以减少对这种并行数据的依赖的程度,我们将其称为\ emph {bimodal监督},并使用独立在每种模态上预处理的模型。我们尝试使用高性能的视力语言模型,并分析双峰监督对三个视力语言任务的影响。我们发现,在更简单的任务(例如VQAV2和GQA)上,可以完全消除双峰监督,而性能仅遭受少量损失。相反,对于需要更复杂的推理的NLVR2而言,没有双峰监督的训练会导致随机绩效。然而,仅使用5%的双峰数据(142k图像以及其字幕),或以每张图像的机器生成标签列表的形式利用弱监督,与使用3M Image-Text Pairs:74 \%\%$ \ rightarrow $ $ \ frientarrow $ $ sim $ 70 \%相比,仅导致中度降级。我们的代码可在https://github.com/eladsegal/less-bimodal-sup上找到。
Standard practice in pretraining multimodal models, such as vision-language models, is to rely on pairs of aligned inputs from both modalities, for example, aligned image-text pairs. However, such pairs can be difficult to obtain in low-resource settings and for some modality pairs (e.g., structured tables and images). In this work, we investigate the extent to which we can reduce the reliance on such parallel data, which we term \emph{bimodal supervision}, and use models that are pretrained on each modality independently. We experiment with a high-performing vision-language model, and analyze the effect of bimodal supervision on three vision-language tasks. We find that on simpler tasks, such as VQAv2 and GQA, one can eliminate bimodal supervision completely, suffering only a minor loss in performance. Conversely, for NLVR2, which requires more complex reasoning, training without bimodal supervision leads to random performance. Nevertheless, using only 5\% of the bimodal data (142K images along with their captions), or leveraging weak supervision in the form of a list of machine-generated labels for each image, leads to only a moderate degradation compared to using 3M image-text pairs: 74\%$\rightarrow$$\sim$70\%. Our code is available at https://github.com/eladsegal/less-bimodal-sup.