视频对象在语言描述中使用语义角色接地

论文标题

视频对象在语言描述中使用语义角色接地

Video Object Grounding using Semantic Roles in Language Description

论文作者

Sadhu, Arka, Chen, Kan, Nevatia, Ram

论文摘要

我们探索视频对象接地的任务（VOG），该任务在自然语言描述中引用的视频中基于对象。以前的方法应用基于图像接地的算法来解决VOG，无法探索对象关系信息并受到有限的概括。在这里，我们调查了对象关系在VOG中的作用，并提出了一个新型的框架Vognet，以通过自我注意和相对位置编码来编码多模式对象关系。为了评估Vognet，我们提出了新颖的对比采样方法来生成更具挑战性的接地输入样本，并根据现有的标题和接地数据集构建一个名为ActivityNet-SRL（ASRL）的新数据集。关于ASRL的实验验证了编码VOG中对象关系的需求，而我们的Vognet的表现优于竞争基准。

We explore the task of Video Object Grounding (VOG), which grounds objects in videos referred to in natural language descriptions. Previous methods apply image grounding based algorithms to address VOG, fail to explore the object relation information and suffer from limited generalization. Here, we investigate the role of object relations in VOG and propose a novel framework VOGNet to encode multi-modal object relations via self-attention with relative position encoding. To evaluate VOGNet, we propose novel contrasting sampling methods to generate more challenging grounding input samples, and construct a new dataset called ActivityNet-SRL (ASRL) based on existing caption and grounding datasets. Experiments on ASRL validate the need of encoding object relations in VOG, and our VOGNet outperforms competitive baselines by a significant margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题