通过三重对比度学习的视力语言预训练

论文标题

通过三重对比度学习的视力语言预训练

Vision-Language Pre-Training with Triple Contrastive Learning

论文作者

Yang, Jinyu, Duan, Jiali, Tran, Son, Xu, Yi, Chanda, Sampath, Chen, Liqun, Zeng, Belinda, Chilimbi, Trishul, Huang, Junzhou

论文摘要

视力语言表示学习在很大程度上通过对比度损失（例如Infonce损失）从图像文本对准中受益。这种对齐策略的成功归因于其在图像及其匹配的文本之间最大化相互信息（MI）的能力。但是，仅执行跨模式比对（CMA）就会忽略每种模态内的数据潜力，这可能会导致降级表示。例如，尽管基于CMA的模型能够在嵌入式空间中映射图像 - 文本对，但它们无法确保相同模态的类似输入保持在附近。当预培训数据嘈杂时，此问题可能会变得更糟。在本文中，我们提出了三重对比学习（TCL），以通过利用跨模式和模式内的自学性来进行视觉预训练。除CMA外，TCL还引入了模式内对比目标，以在表示学习中提供互补的好处。为了利用来自图像和文本输入的局部和结构信息，TCL进一步最大化了图像/文本局部区域之间的平均MI及其全局摘要。据我们所知，我们的工作是第一项考虑多模式表示学习的本地结构信息。实验评估表明，我们的方法具有竞争力，并在各种常见的下游视觉语言任务（例如图像文本检索和视觉问题回答）上实现了新的艺术状态。

Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.

下载PDF全文

下载文献需遵守相关版权规定

论文标题