视觉智能：任务，表示学习和大型模型

论文标题

视觉智能：任务，表示学习和大型模型

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

论文作者

Li, Feng, Zhang, Hao, Zhang, Yi-Fan, Liu, Shilong, Guo, Jian, Ni, Lionel M., Zhang, PengChuan, Zhang, Lei

论文摘要

本文从时间的角度提出了对视觉语言（VL）智能的全面调查。这项调查的灵感来自计算机视觉和自然语言处理的显着进步，最近的趋势从单一模态处理转变为多种方式理解。我们将该领域的发展分为三个时段，即特定于任务的方法，视觉语言预训练（VLP）方法以及由大规模弱标记的数据授权的较大模型。我们首先以一些常见的VL任务为示例来介绍特定于任务方法的开发。然后，我们专注于VLP方法，并全面审查模型结构和培训方法的关键组成部分。之后，我们展示了最近的工作是如何利用大规模的原始图像文本数据来学习与语言一致的视觉表示，这些视觉表示在零或几次射击学习任务上更具推广性。最后，我们讨论了朝着模式合作，统一表示和知识融合的一些潜在未来趋势。我们认为，这项审查将为AI和ML的研究人员和从业者提供帮助，尤其是那些对计算机视觉和自然语言处理感兴趣的人。

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or few shot learning tasks. Finally, we discuss some potential future trends towards modality cooperation, unified representation, and knowledge incorporation. We believe that this review will be of help for researchers and practitioners of AI and ML, especially those interested in computer vision and natural language processing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题