层次的多模式编码器，用于视频语料库中的时刻本地化

论文标题

层次的多模式编码器，用于视频语料库中的时刻本地化

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

论文作者

Zhang, Bowen, Hu, Hexiang, Lee, Joonseok, Zhao, Ming, Chammas, Sheide, Jain, Vihan, Ie, Eugene, Sha, Fei

论文摘要

在长时间的视频中识别一个简短的细分市场，即语义上与文本查询匹配是一项具有挑战性的任务，在基于语言的视频搜索，浏览和导航中具有重要的应用程序潜力。典型的检索系统通过整个视频或预定义的视频段对查询做出响应，但是在未经修剪和未分段的视频中本地将未定义的段定位为挑战，在这些视频中，在所有可能的段中详尽地搜索所有可能的段是很可悲的。杰出的挑战是，视频的表示必须说明时间领域中不同水平的粒度。为了解决这个问题，我们提出了层次多模式编码器（Hammer），该编码器在粗粒剪辑级别和细粒框架级别上编码视频，以根据多个子任务，即视频检索，段，时间定位和掩盖的语言模型来提取不同尺度的信息。我们进行了广泛的实验，以评估我们在活动网络字幕和TVR数据集中在视频语料库中定位的模型。我们的方法的表现优于先前的方法和强大的基线，为此任务建立了新的最新方法。

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets. Our approach outperforms the previous methods as well as strong baselines, establishing new state-of-the-art for this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题