论文标题
带有知识蒸馏视频字幕的时空图形
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
论文作者
论文摘要
视频字幕是一项具有挑战性的任务,需要对视觉场景有深入的了解。最新的方法使用场景级或对象级信息生成字幕,但没有明确建模对象交互。因此,他们通常无法做出视觉上的预测,并且对虚假相关性敏感。在本文中,我们提出了一个新颖的时空图模型,用于视频字幕,以利用时空中的对象相互作用。我们的模型建立了可解释的链接,并能够提供明确的视觉接地。为了避免因物体数量数量引起的不稳定性能,我们进一步提出了一种对象感知的知识蒸馏机制,其中局部对象信息用于正规化全局场景特征。我们通过在两个基准上进行广泛的实验来证明我们的方法的功效,这表明我们的方法通过可解释的预测产生了竞争性能。
Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.