tl; dw？用任务相关性和跨模式显着性汇总教学视频

论文标题

tl; dw？用任务相关性和跨模式显着性汇总教学视频

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

论文作者

Narasimhan, Medhini, Nagrani, Arsha, Sun, Chen, Rubinstein, Michael, Darrell, Trevor, Rohrbach, Anna, Schmid, Cordelia

论文摘要

寻找特定任务说明的YouTube用户可能会花费很长时间浏览内容，试图找到与他们需求相匹配的正确视频。创建视觉摘要（视频的删节版本）为观众提供了快速概述，并大大减少了搜索时间。在这项工作中，我们专注于总结教学视频，这是视频摘要的探索领域。与通用视频相比，教学视频可以分解为具有语义上有意义的细分，这些细分与所示任务的重要步骤相对应。现有的视频摘要数据集依赖于手动框架级注释，使其主观且大小有限。为了克服这一点，我们首先通过利用两个关键假设来自动为教学视频语料库生成伪摘要：（i）相关步骤很可能出现在相同任务（任务相关性）的多个视频中，并且（ii）它们更可能由示威者垂直（交叉模块化显着）描述。我们提出了一个教学视频摘要网络，该网络结合了上下文感知的时间视频编码器和段评分变压器。使用伪摘要作为弱监督，我们的网络构建了仅给出视频和转录语音的教学视频的视觉摘要。为了评估我们的模型，我们通过刮擦包含视频演示的Wikihow文章和步骤的视觉描绘，从而收集了高质量的测试集，即Wikihow摘要，从而使我们能够获得地面真相摘要。我们的表现优于几个基线和这个新基准的最先进的视频摘要模型。

YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题