论文标题
图像文本检索的动态对比度蒸馏
Dynamic Contrastive Distillation for Image-Text Retrieval
论文作者
论文摘要
尽管配备的跨模式图像文本检索(ITR)在过去两年中取得了显着的进步,但它遭受了重大缺点:VLP模型的尺寸不断增加,它限制了其部署到现实世界中的搜索场景(高延迟不可能的情况)。为了减轻此问题,我们提出了一种新颖的插件动态对比度蒸馏(DCD)框架,以压缩ITR任务的大型VLP模型。从技术上讲,我们面临以下两个挑战:1)由于GPU存储器有限,因此在处理交叉模式融合功能期间优化了太多的负样本,因此很难直接应用于跨模式任务。 2)从不同的硬样品中静态优化学生网络的效率效率低下,这对蒸馏学习和学生网络优化具有不同的影响。我们试图从两点克服这些挑战。首先,为了实现多模式的对比学习,并平衡培训成本和效果,我们建议使用教师网络估算学生的困难样本,使学生吸收了预先训练的老师的强大知识,并掌握了来自硬样本的知识。其次,要从硬样品对学习动态,我们提出动态蒸馏以动态学习不同困难的样本,从更好地平衡知识和学生的自学能力的困难的角度。我们成功地将我们提出的DCD策略应用于两个最先进的视觉语言预处理模型,即vilt和仪表。关于MS-Coco和FlickR30K基准测试的广泛实验显示了我们DCD框架的有效性和效率。令人鼓舞的是,与现有的ITR模型相比,我们可以至少加快推断至少129美元的$ \ times $。
Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restricts its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to the cross-modal tasks, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129$\times$ compared to the existing ITR models.