所有组合都相等吗？将文本和视觉特征与多个空间学习相结合，用于基于文本的视频检索

论文标题

所有组合都相等吗？将文本和视觉特征与多个空间学习相结合，用于基于文本的视频检索

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

论文作者

Galanopoulos, Damianos, Mezaris, Vasileios

论文摘要

在本文中，我们解决了跨模式视频检索问题，更具体地说，我们专注于文本到视频检索。我们研究了如何将多种多样的文本和视觉特征结合到特征对中，从而导致多个关节特征空间，这些特征空间将文本视频对编码为可比的表示。为了了解这些表示形式，我们提出的网络体系结构是通过遵循多个太空学习程序来训练的。此外，在检索阶段，我们介绍了其他软件操作，以修改推断的查询视频相似性。基于三个大规模数据集（IACC.3，V3C1和MSR-VTT）的几个设置中的广泛实验，导致了如何最好地结合文本视觉功能并记录所提出网络的性能。源代码可公开可用：https：//github.com/bmezaris/textttovideoreteoretereval-ttimesv

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV

下载PDF全文

下载文献需遵守相关版权规定

论文标题