论文标题

小型:较小,更快的视觉语言模型

MiniVLM: A Smaller and Faster Vision-Language Model

论文作者

Wang, Jianfeng, Hu, Xiaowei, Zhang, Pengchuan, Li, Xiujun, Wang, Lijuan, Zhang, Lei, Gao, Jianfeng, Liu, Zicheng

论文摘要

最近的视觉语言(VL)研究表明,通过从变压器模型中学习通用表示,然后在下游VL任务上进行微调,这表明了取得了显着的进步。尽管现有的研究集中在通过大型预培训模型中实现高精度,但构建轻量级模型在实践中具有很高的价值,但探索较少。在本文中,我们提出了一个较小,更快的VL模型Menivlm,可以在各种下游任务(例如其较大的对手)上以良好的性能进行审核。小型型由两个模块组成,一个视觉特征提取器和一个基于变压器的视觉融合模块。我们设计了一个受一阶段有效网络启发的两阶段高效提取器(TEE),与基线模型相比,视觉功能提取的时间成本显着降低了$ 95 \%$。在比较不同的紧凑型BERT模型后,我们采用微型结构来降低变压器模块的计算成本。此外,我们通过添加700万美元的开放图像数据来改善小型预培训,这些数据由最先进的字幕模型伪造。我们还使用从强标模型获得的高质量图像标签进行预训练,以增强交叉模式对齐。大型型号是离线使用的,而无需在微调和推理中添加任何开销。有了上述设计选择,我们的小型型将型号尺寸降低了$ 73 \%$,推理时间成本降低了$ 94 \%$ $,同时可以保留多个VL任务的$ 94-97 \%\%$ $。我们希望小型型有助于减轻最先进的VL研究对边缘应用的使用。

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源