时间接地的本地全球视频文本交互

论文标题

时间接地的本地全球视频文本交互

Local-Global Video-Text Interactions for Temporal Grounding

论文作者

Mun, Jonghwan, Cho, Minsu, Han, Bohyung

论文摘要

本文解决了文本到视频时间基础的问题，该问题旨在确定与文本查询相关的视频中的时间间隔。我们使用基于回归的新型模型来解决这个问题，该模型学会了在文本查询中提取语义短语中的中级特征的集合，该功能与查询中描述的重要语义实体相对应（例如，演员，对象和动作），并反映了Query Query和视频特征在多个层次中的视频功能之间的双模式相互作用。提出的方法通过在双模式相互作用期间从本地到全局利用上下文信息有效地预测了目标时间间隔。通过深入的消融研究，我们发现将局部和全球背景纳入视频和文本相互作用对于准确的接地至关重要。我们的实验表明，所提出的方法分别以较大的边距在charades-sta和ActivityNet字幕数据集上优于艺术状态，分别以7.44 \％\％和4.61 \％\％\％\％\％\％\％\％。代码可在https://github.com/jonghwanmun/lgi4temporalgrounding中找到。

This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44\% and 4.61\% points at Recall@tIoU=0.5 metric, respectively. Code is available in https://github.com/JonghwanMun/LGI4temporalgrounding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题