论文标题
学会通过视觉语言模型提示开放式视频对象检测
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
论文作者
论文摘要
最近,视力语言预训练显示出在开放式摄制对象检测中的巨大潜力,在该检测中,设计了在基础类中训练的检测器来检测新类别。类文本嵌入首先是通过将提示提示馈送到预训练的视觉模型的文本编码器中生成的。然后将其用作监督检测器训练的区域分类器。导致该模型成功的关键要素是正确的提示,它需要仔细的单词调整和巧妙的设计。为了避免使用费力的及时工程,为图像分类任务提供了一些及时的表示学习方法,但是,当应用于检测任务时,这些方法只能是次优的解决方案。在本文中,我们介绍了一种新颖的方法,即检测提示(detpro),以根据预先训练的视觉语言模型来学习开放式视频对象检测的连续提示表示。与以前的面向分类的方法不同,detpro具有两个亮点:1)背景解释方案,将图像背景中的建议包括在及时训练中; 2)上下文分级方案,将图像前景中的提案分开,以进行量身定制的及时培训。我们与最新的开放世界对象检测器Vild组装了Detpro,并在LVIS上进行实验,并在Pascal VOC,Coco,Objects365数据集中进行转移学习。实验结果表明,我们的detpro在所有设置中都优于基线Vild,例如+3.4 apbox和+3.0 apmask改进了新型LVIS。代码和型号可在https://github.com/dyabel/detpro中找到。
Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when applied to the detection task. In this paper, we introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection based on the pre-trained vision-language model. Different from the previous classification-oriented methods, DetPro has two highlights: 1) a background interpretation scheme to include the proposals in image background into the prompt training; 2) a context grading scheme to separate proposals in image foreground for tailored prompt training. We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector, and conduct experiments on the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365 datasets. Experimental results show that our DetPro outperforms the baseline ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the novel classes of LVIS. Code and models are available at https://github.com/dyabel/detpro.