M2HF：文本视频检索的多级多模式混合融合

论文标题

M2HF：文本视频检索的多级多模式混合融合

M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

论文作者

Liu, Shuo, Quan, Weize, Zhou, Ming, Chen, Sihong, Kang, Jian, Zhao, Zhe, Chen, Chen, Yan, Dong-Ming

论文摘要

视频包含多模式内容，并与自然语言查询探索多级跨模式互动可以为文本视频检索任务（TVR）提供极大的突出性。但是，应用大规模预训练的型号剪辑的新趋势方法不关注视频中的多模式提示。此外，传统方法简单地串联多模式功能不会利用视频中细粒度的跨模式信息。在本文中，我们提出了一个多级多模式混合融合（M2HF）网络，以探索视频中文本查询与每个模态内容之间的全面互动。具体而言，M2HF首先利用剪辑提取的视觉特征与从视频中提取的音频和运动功能提取到早期融合，分别获得视听融合功能和运动视觉融合功能。在此过程中还考虑了多模式对准问题。然后，视觉功能，视听融合功能，运动视觉融合功能以及从视频中提取的文本以多级别的方式与标题查询建立了交叉模式关系。最后，从所有级别的检索输出都延迟融合，以获得最终的文本视频检索结果。我们的框架提供了两种培训策略，包括合奏的方式和端到端的方式。此外，提出了一种新型的多模式平衡损失函数，以平衡每种方式对有效端到端训练的贡献。 M2HF允许我们在各种基准测试上获得最先进的结果，例如64.9 \％，68.2 \％，33.2 \％，57.1 \％，57.8％，57.8 \％，MSR-VTT，MSVD，MSVD，MSVD，LSMDC，DIDEMO，DIDEMO和ACTIVENNETNET。

Videos contain multi-modal content, and exploring multi-level cross-modal interactions with natural language queries can provide great prominence to text-video retrieval task (TVR). However, new trending methods applying large-scale pre-trained model CLIP for TVR do not focus on multi-modal cues in videos. Furthermore, the traditional methods simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos. Specifically, M2HF first utilizes visual features extracted by CLIP to early fuse with audio and motion features extracted from videos, obtaining audio-visual fusion features and motion-visual fusion features respectively. Multi-modal alignment problem is also considered in this process. Then, visual features, audio-visual fusion features, motion-visual fusion features, and texts extracted from videos establish cross-modal relationships with caption queries in a multi-level way. Finally, the retrieval outputs from all levels are late fused to obtain final text-video retrieval results. Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner. Moreover, a novel multi-modal balance loss function is proposed to balance the contributions of each modality for efficient end-to-end training. M2HF allows us to obtain state-of-the-art results on various benchmarks, eg, Rank@1 of 64.9\%, 68.2\%, 33.2\%, 57.1\%, 57.8\% on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题