论文标题

所有组合都相等吗?将文本和视觉特征与多个空间学习相结合,用于基于文本的视频检索

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

论文作者

Galanopoulos, Damianos, Mezaris, Vasileios

论文摘要

在本文中,我们解决了跨模式视频检索问题,更具体地说,我们专注于文本到视频检索。我们研究了如何将多种多样的文本和视觉特征结合到特征对中,从而导致多个关节特征空间,这些特征空间将文本视频对编码为可比的表示。为了了解这些表示形式,我们提出的网络体系结构是通过遵循多个太空学习程序来训练的。此外,在检索阶段,我们介绍了其他软件操作,以修改推断的查询视频相似性。基于三个大规模数据集(IACC.3,V3C1和MSR-VTT)的几个设置中的广泛实验,导致了如何最好地结合文本视觉功能并记录所提出网络的性能。源代码可公开可用:https://github.com/bmezaris/textttovideoreteoretereval-ttimesv

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源