持续视力语言预告片的产生负面文本重播

论文标题

持续视力语言预告片的产生负面文本重播

Generative Negative Text Replay for Continual Vision-Language Pretraining

论文作者

Yan, Shipeng, Hong, Lanqing, Xu, Hang, Han, Jianhua, Tuytelaars, Tinne, Li, Zhenguo, He, Xuming

论文摘要

视力语言预训练（VLP）最近引起了人们越来越多的关注。有了大量的图像文本对，接受对比性损失训练的VLP模型在各种任务中取得了令人印象深刻的性能，尤其是在下游数据集上的零拍概括。但是，在实际应用中，通常以流方式收集大量数据，要求VLP模型不断地整合来自传入数据的新知识并保留学习知识。在这项工作中，我们专注于学习具有顺序的图像文本对数据的VLP模型。为了解决这个多模式持续学习环境中的灾难性遗忘问题，我们首先介绍伪文本重播，该重播产生了记忆中的训练图像的硬性负面文本，这不仅可以更好地保留学习知识，而且可以提高对比损失中负面样本的多样性。此外，我们提出了图像和文本之间的多模式知识蒸馏，以使旧模型和新模型之间的实例预测对齐。我们会在概念字幕数据集的实例和类增量拆分上逐步预训练我们的模型，并在零拍图像分类和图像文本检索任务上评估模型。我们的方法一致地优于现有基线的幅度很大，这表明了它的优势。值得注意的是，我们意识到，对于类增量拆分的图像分类下游数据集，平均绩效提升为$ 4.60 \％$。

Vision-language pre-training (VLP) has attracted increasing attention recently. With a large amount of image-text pairs, VLP models trained with contrastive loss have achieved impressive performance in various tasks, especially the zero-shot generalization on downstream datasets. In practical applications, however, massive data are usually collected in a streaming fashion, requiring VLP models to continuously integrate novel knowledge from incoming data and retain learned knowledge. In this work, we focus on learning a VLP model with sequential chunks of image-text pair data. To tackle the catastrophic forgetting issue in this multi-modal continual learning setting, we first introduce pseudo text replay that generates hard negative texts conditioned on the training images in memory, which not only better preserves learned knowledge but also improves the diversity of negative samples in the contrastive loss. Moreover, we propose multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models. We incrementally pre-train our model on both the instance and class incremental splits of the Conceptual Caption dataset, and evaluate the model on zero-shot image classification and image-text retrieval tasks. Our method consistently outperforms the existing baselines with a large margin, which demonstrates its superiority. Notably, we realize an average performance boost of $4.60\%$ on image-classification downstream datasets for the class incremental split.

下载PDF全文

下载文献需遵守相关版权规定

论文标题