更接近地面更好：视频中句子的弱点底依

论文标题

更接近地面更好：视频中句子的弱点底依

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

论文作者

Chen, Zhenfang, Ma, Lin, Luo, Wenhan, Tang, Peng, Wong, Kwan-Yee K.

论文摘要

在本文中，我们研究了视频中句子的暂时基础的问题。具体而言，给定一个未修剪的视频和查询句子，我们的目标是在视频中定位一个时间段，该视频在语义上与查询句子相对应，而在培训过程中不依赖任何时间注释。我们提出了一个两阶段的模型，以粗略的方式解决这个问题。在粗糙的阶段，我们首先使用多尺度滑动窗口生成一组固定长度的时间提案，并将其视觉特征与句子特征相匹配，以识别最佳匹配的建议作为粗接地结果。在罚款阶段，我们在最佳匹配的建议中与框架的视觉特征和句子特征之间进行了细粒度的匹配，以定位细接地结果的精确框架边界。有关活动网字幕数据集和Charades-STA数据集的全面实验表明，我们的两阶段模型可实现令人信服的性能。

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal annotation during training. We propose a two-stage model to tackle this problem in a coarse-to-fine manner. In the coarse stage, we first generate a set of fixed-length temporal proposals using multi-scale sliding windows, and match their visual features against the sentence features to identify the best-matched proposal as a coarse grounding result. In the fine stage, we perform a fine-grained matching between the visual features of the frames in the best-matched proposal and the sentence features to locate the precise frame boundary of the fine grounding result. Comprehensive experiments on the ActivityNet Captions dataset and the Charades-STA dataset demonstrate that our two-stage model achieves compelling performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题