以自我为中心视频的文字概要生成

论文标题

以自我为中心视频的文字概要生成

Text Synopsis Generation for Egocentric Videos

论文作者

Sharghi, Aidean, Lobo, Niels da Vitoria, Shah, Mubarak

论文摘要

大规模造型相机的大规模利用导致了大量以自我为中心的视频。现有的视频摘要算法可以通过从中选择（视觉）有趣的镜头来加速浏览此类视频。但是，由于系统用户仍然必须观看摘要视频，因此浏览大型视频数据库仍然是一个挑战。因此，在这项工作中，我们建议生成一个文本摘要，其中包括一些句子，描述了以自负为中心的视频中最重要的事件。用户可以阅读简短的文本，以获取有关视频的见解，更重要的是，有效地搜索使用文本查询的大型视频数据库的内容。由于以自我为中心的视频很长，并且包含许多活动和事件，因此使用视频对文本算法会导致数千个描述，其中许多描述是不正确的。因此，我们提出了一个多任务学习方案，以同时生成视频片段的描述，并以端到端的方式总结所得的描述。我们输入一组视频镜头，网络为每次拍摄生成文本描述。接下来，经过弱监督目标训练的视觉语言内容匹配单元可以确定正确的描述。最后，我们网络的最后一个组件（称为Pupport Network）一起评估了描述，以选择包含关键信息的描述。在为视频生成的数千个描述中，返回了一些信息的句子。我们在具有挑战性的UT egentric视频数据集上验证了我们的框架，其中每个视频的长度在3至5小时之间，平均与3000多个文本描述有关。将生成的文本摘要（包括生成的描述的5％（或更少））与文本域中的地面摘要进行了比较，并使用自然语言处理中的良好指标进行了比较。

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题