论文标题

X $^2 $ -VLM:视觉任务的多合一训练模型

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

论文作者

Zeng, Yan, Zhang, Xinsong, Li, Hang, Wang, Jiawei, Zhang, Jipeng, Zhou, Wangchunshu

论文摘要

视觉语言预训练旨在从大量数据中学习视觉和语言之间的一致性。大多数现有方法仅学习图像文本对齐。其他一些则利用预训练的对象检测器来利用对象级别的视觉语言对齐。在本文中,我们建议通过一个统一的训练前框架来学习多元透明的视觉语言对齐,该框架同时学习多元格的对齐和多元透度的本地化。基于它,我们提出X $^2 $ -VLM,这是一种具有灵活的模块化体系结构的多合一模型,在该模型中,我们进一步将一个模型中的图像文本预训练和视频预培训统一。 X $^2 $ -VLM能够学习与不同文本描述相关的无限视觉概念。实验结果表明,X $^2 $ -VLM在图像文本和视频文本任务方面的基础上表现最好,并且在性能和模型量表之间做出了良好的权衡。此外,我们表明X $^2 $ -VLM的模块化设计可在任何语言或域中使用它的高传递性。例如,通过简单地用XLM-R替换文本编码器,X $^2 $ -VLM优于最先进的多语言多模式的预训练模型,而无需任何多语言的预训练。代码和预培训模型可在https://github.com/zengyan-97/x2-vlm上找到。

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源