从未经修剪的视频中学习：以层次结构的一致性学习自我监督的视频表示

论文标题

从未经修剪的视频中学习：以层次结构的一致性学习自我监督的视频表示

Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency

论文作者

Qing, Zhiwu, Zhang, Shiwei, Huang, Ziyuan, Xu, Yi, Wang, Xiang, Tang, Mingqian, Gao, Changxin, Jin, Rong, Sang, Nong

论文摘要

自然视频为自学学习提供丰富的视觉内容。然而，大多数现有的学习时空表示方法都取决于手动修剪的视频，从而导致视觉模式的多样性和有限的性能增益有限。在这项工作中，我们旨在通过利用未修剪的视频中更多丰富的信息来学习表示形式。为此，我们建议学习视频中一致性的层次结构，即视觉一致性和局部一致性，分别与剪辑对相对应，这些剪辑对在短时间间隔时往往在视觉上相似，并且在长时间跨度分离时共享相似的主题。具体而言，提出了层次的一致性学习框架，在该框架上，鼓励视觉上一致的对通过对比度学习具有相同的表示，而局部一致的对通过主题分类器耦合，该分类器区分它们是否与主题相关。此外，我们为提出的层次一致性学习施加了一种逐步采样算法，并证明了其理论优势。从经验上讲，我们表明，hico不仅可以在未修剪的视频上产生更强的表示，而且还可以提高对修剪视频的表示质量。这与标准的对比学习形成鲜明对比，后者未能从未修剪的视频中学习适当的表示形式。

Natural videos provide rich visual contents for self-supervised learning. Yet most existing approaches for learning spatio-temporal representations rely on manually trimmed videos, leading to limited diversity in visual patterns and limited performance gain. In this work, we aim to learn representations by leveraging more abundant information in untrimmed videos. To this end, we propose to learn a hierarchy of consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span and share similar topics when separated by a long time span. Specifically, a hierarchical consistency learning framework HiCo is presented, where the visually consistent pairs are encouraged to have the same representation through contrastive learning, while the topically consistent pairs are coupled through a topical classifier that distinguishes whether they are topic related. Further, we impose a gradual sampling algorithm for proposed hierarchical consistency learning, and demonstrate its theoretical superiority. Empirically, we show that not only HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos. This is in contrast to standard contrastive learning that fails to learn appropriate representations from untrimmed videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题