论文标题

利用图像token的一致性来进行视觉培训

Leveraging per Image-Token Consistency for Vision-Language Pre-training

论文作者

Gou, Yunhao, Ko, Tom, Yang, Hansi, Kwok, James, Zhang, Yu, Wang, Mingxuan

论文摘要

大多数现有的视觉语言预训练(VLP)方法采用跨模式掩盖语言建模(CMLM)来学习视觉语言协会。但是,我们发现,根据我们的观察结果,CMLM不足以实现此目的:(1)模态偏见:只能使用语言信息来恢复CMLM中的大量掩盖令牌,而忽略视觉输入。 (2)未掩盖的令牌的利用不足:CMLM主要集中在掩盖的令牌上,但不能同时利用其他令牌来学习视觉语言关联。为了应对这些局限性,我们提出了史诗般的史诗般的(为视觉预训练的每个图像 - token脚一致性)。 In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task).提出的EPIC方法很容易与训练方法结合使用。广泛的实验表明,史诗方法和最先进的预训练方法(包括vilt,albef,米和X-vlm)的组合可导致下游任务的重大改进。该代码在https://github.com/gyhdog99/epic上发布。

Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. The code is released at https://github.com/gyhdog99/epic.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源