论文标题
在未修剪视频中,自然语言的弱监督时间基础的强化学习
Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos
论文作者
论文摘要
自然语言在未修剪视频中的时间基础是一项基本但挑战性的多媒体任务,促进了跨媒体视觉内容的检索。我们专注于此任务的弱监督设置,该任务仅访问没有时间边界的粗略视频级语言描述注释,这与现实更加一致,因为在实践中,这种弱标签更容易获得。在本文中,我们提出了一个\ emph {边界自适应改进}(bar)框架,该框架诉诸增强学习(RL),以指导逐步完善时间边界的过程。据我们所知,我们首次尝试将RL扩展到临时本地化任务。由于在没有成对粒状边界问题的情况下获得直接的奖励函数是不平凡的,因此精心制作了一个跨模式对齐评估器来测量节点问题的比对度以提供量身定制的奖励。这种改进方案完全放弃了传统的基于滑动窗的解决方案模式,并有助于获得更有效的,边界充满且具有内容感知的接地结果。对两个公共基准Charades-STA和ActivityNet进行的广泛实验表明,Bar的表现优于最先进的弱监督方法,甚至击败了一些竞争激烈的人。
Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a \emph{Boundary Adaptive Refinement} (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.