论文标题
MILES:视频文本检索的视觉BERT预训练和注入的语言语义
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
论文作者
论文摘要
视频文本检索的主要培训工作主要采用“双编码”体系结构来实现有效的检索,其中使用了两个单独的编码器来对比全局视频和文本表示,但忽略了详细的本地语义。图像BERT预训练的最新成功使用掩盖的视觉建模促进了局部视觉上下文的学习,这激发了解决上述限制的可能解决方案。在这项工作中,我们第一次使用“双重编码”体系结构在视频文本预训练中研究蒙面的视觉建模。我们通过使用额外的快照视频编码器作为不断发展的“令牌”来生成掩盖视频补丁预测的重建目标,从而使用注入的语言语义(MILE)进行掩盖的视觉建模。鉴于损坏的视频,对视频编码器进行了训练,可以通过沿空间和时间尺寸的可见区域进行推理来恢复蒙面贴片的文本一致特征,从而增强了局部视觉特征和细粒度交叉模式对齐的歧视性。我们的方法优于在四个数据集上具有零射击和微调评估协议的四个数据集上文本到视频检索的最先进方法。我们的方法还可以在零拍动识别的情况下显着超过基线模型,该模型可以作为视频到文本检索。
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.