论文标题
学会了解视频检索中的否定
Learn to Understand Negation in Video Retrieval
论文作者
论文摘要
否定是一种常见的语言技能,使人类能够表达我们不想要的东西。自然,人们可能会期望视频检索会以否定为支持自然语言查询,例如,发现坐在地板上而不是和狗一起玩的孩子的照片。但是,最先进的基于深度学习的视频检索模型缺乏这种能力,因为它们通常在视频描述数据集中受过培训,例如MSR-VTT和VATEX,而这些数据集缺乏否定的描述。他们的检索结果基本上忽略了示例查询中的否定器,错误地返回的视频显示了孩子们玩狗。本文介绍了有关学习视频检索中否定的第一个研究,并做出如下贡献。通过重新修复两个现有数据集(MSR-VTT和VATEX),我们提出了一个新的评估协议,以进行否定视频检索。我们提出了一种基于学习的方法,用于培训否定的视频检索模型。关键的想法是首先通过部分否定其原始标题来为特定的培训视频构造软性标题,然后计算三重限制的三胞胎损失。这种辅助损失将重量添加到标准检索损失中。重新组合基准的实验表明,通过拟议的方法重新训练剪辑(对比语言图像预训练)模型清楚地提高了其用否定处理查询的能力。此外,还提高了原始基准测试的模型性能。
Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with a dog. However, the state-of-the-art deep learning based video retrieval models lack such ability, as they are typically trained on video description datasets such as MSR-VTT and VATEX that lack negated descriptions. Their retrieved results basically ignore the negator in the sample query, incorrectly returning videos showing kids playing with dog. This paper presents the first study on learning to understand negation in video retrieval and make contributions as follows. By re-purposing two existing datasets (MSR-VTT and VATEX), we propose a new evaluation protocol for video retrieval with negation. We propose a learning based method for training a negation-aware video retrieval model. The key idea is to first construct a soft negative caption for a specific training video by partially negating its original caption, and then compute a bidirectionally constrained loss on the triplet. This auxiliary loss is weightedly added to a standard retrieval loss. Experiments on the re-purposed benchmarks show that re-training the CLIP (Contrastive Language-Image Pre-Training) model by the proposed method clearly improves its ability to handle queries with negation. In addition, the model performance on the original benchmarks is also improved.