论文标题
AltClip:更改剪辑中的语言编码器以提高语言功能
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
论文作者
论文摘要
在这项工作中,我们提出了一种概念上简单有效的方法,用于培训强大的双语/多语言多模式表示模型。从OpenAI发布的预先训练的多模式模型剪辑开始,我们使用了预先训练的多语言文本编码器XLM-R更改了其文本编码器,并通过由教师学习和对比学习组成的两阶段培训模式来对齐语言和图像表示。我们通过评估广泛的任务来验证我们的方法。我们在包括Imagenet-CN,Flicker30k-CN,Coco-CN和XTD等一系列任务上设置了新的最先进的表演。此外,我们在几乎所有任务上都可以通过剪辑获得非常紧密的性能,这表明人们可以简单地在剪辑中更改文本编码器,以获得扩展功能,例如多语言理解。我们的模型和代码可在https://github.com/flagai-open/flagai上找到。
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.