小型：较小，更快的视觉语言模型

论文标题

小型：较小，更快的视觉语言模型

MiniVLM: A Smaller and Faster Vision-Language Model

论文作者

Wang, Jianfeng, Hu, Xiaowei, Zhang, Pengchuan, Li, Xiujun, Wang, Lijuan, Zhang, Lei, Gao, Jianfeng, Liu, Zicheng

论文摘要

最近的视觉语言（VL）研究表明，通过从变压器模型中学习通用表示，然后在下游VL任务上进行微调，这表明了取得了显着的进步。尽管现有的研究集中在通过大型预培训模型中实现高精度，但构建轻量级模型在实践中具有很高的价值，但探索较少。在本文中，我们提出了一个较小，更快的VL模型Menivlm，可以在各种下游任务（例如其较大的对手）上以良好的性能进行审核。小型型由两个模块组成，一个视觉特征提取器和一个基于变压器的视觉融合模块。我们设计了一个受一阶段有效网络启发的两阶段高效提取器（TEE），与基线模型相比，视觉功能提取的时间成本显着降低了$ 95 \％$。在比较不同的紧凑型BERT模型后，我们采用微型结构来降低变压器模块的计算成本。此外，我们通过添加700万美元的开放图像数据来改善小型预培训，这些数据由最先进的字幕模型伪造。我们还使用从强标模型获得的高质量图像标签进行预训练，以增强交叉模式对齐。大型型号是离线使用的，而无需在微调和推理中添加任何开销。有了上述设计选择，我们的小型型将型号尺寸降低了$ 73 \％$，推理时间成本降低了$ 94 \％$ $，同时可以保留多个VL任务的$ 94-97 \％\％$ $。我们希望小型型有助于减轻最先进的VL研究对边缘应用的使用。

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题