论文标题
视觉和语言预算的模型:调查
Vision-and-Language Pretrained Models: A Survey
论文作者
论文摘要
预验证的模型在计算机视觉(CV)和自然语言处理(NLP)方面取得了巨大成功。这一进步导致了通过将视觉和语言内容喂入多层变压器,视觉语言预告片的模型(VLPM)来预测视觉和语言的联合表示。在本文中,我们概述了VLPM在产生视觉和语言共同表示的主要进步。作为初步,我们简要描述了VLPM的一般任务定义和遗传结构。我们首先讨论语言和视觉数据编码方法,然后将主流VLPM结构作为核心内容介绍。我们进一步总结了几种基本的预处理和微调策略。最后,我们重点介绍了简历和NLP研究人员提供有见地指导的三个未来方向。
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Processing (NLP). This progress leads to learning joint representations of vision and language pretraining by feeding visual and linguistic contents into a multi-layer transformer, Visual-Language Pretrained Models (VLPMs). In this paper, we present an overview of the major advances achieved in VLPMs for producing joint representations of vision and language. As the preliminaries, we briefly describe the general task definition and genetic architecture of VLPMs. We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content. We further summarise several essential pretraining and fine-tuning strategies. Finally, we highlight three future directions for both CV and NLP researchers to provide insightful guidance.