论文标题
自我监督视频表示学习的静态和动态概念学习
Static and Dynamic Concepts for Self-supervised Video Representation Learning
论文作者
论文摘要
在本文中,我们提出了一种新颖的学习方案,用于自我监督的视频表示学习。受到人类如何理解视频的动机,我们建议首先学习一般视觉概念,然后参加歧视性的局部区域以进行视频理解。具体而言,我们利用静态框架和框架差异来帮助解开静态和动态概念,并分别使潜在空间中的概念分布对齐。我们增加了多样性和忠诚度正常,以确保我们学习一套紧凑的有意义的概念。然后,我们采用跨注意机制来汇总不同概念的详细局部特征,并滤除具有低激活的冗余概念以执行局部概念对比。广泛的实验表明,我们的方法会提取有意义的静态和动态概念来指导视频理解,并在UCF-101,HMDB-51和潜水-48上获得最新的结果。
In this paper, we propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art results on UCF-101, HMDB-51, and Diving-48.