论文标题
GOCA:指导的在线集群分配用于自我监督的视频表示学习
GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning
论文作者
论文摘要
聚类是无监督学习中无处不在的工具。大多数现有的自我监督表示方法通常基于视觉主导特征聚类样本。尽管这对于基于图像的自我审视效果很好,但它通常无法用于视频,这需要理解运动而不是专注于背景。将光流作为与RGB的互补信息可以减轻此问题。但是,我们观察到,两种观点的幼稚组合并不能带来有意义的收益。在本文中,我们提出了一种结合两种观点的原则方法。具体而言,我们提出了一种新颖的聚类策略,在指导其他视图的最终群集分配之前,我们将每个视图的初始群集分配使用。这个想法将对这两种视图强制执行类似的群集结构,而所形成的群集将在语义上是抽象的,并且对来自每个单独视图的嘈杂输入都将是强大的。此外,我们提出了一种新颖的正则化策略来解决特征崩溃问题,这在基于聚类的自学学习方法中很常见。我们的广泛评估表明,我们学到的表示对下游任务的有效性,例如视频检索和动作识别。具体而言,我们在UCF上的最高效果优于7%,在HMDB上胜过4%,以用于视频检索,而在UCF上,HMDB为5%,HMDB进行视频分类为6%
Clustering is a ubiquitous tool in unsupervised learning. Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features. While this works well for image-based self-supervision, it often fails for videos, which require understanding motion rather than focusing on background. Using optical flow as complementary information to RGB can alleviate this problem. However, we observe that a naive combination of the two views does not provide meaningful gains. In this paper, we propose a principled way to combine two views. Specifically, we propose a novel clustering strategy where we use the initial cluster assignment of each view as prior to guide the final cluster assignment of the other view. This idea will enforce similar cluster structures for both views, and the formed clusters will be semantically abstract and robust to noisy inputs coming from each individual view. Additionally, we propose a novel regularization strategy to address the feature collapse problem, which is common in cluster-based self-supervised learning methods. Our extensive evaluation shows the effectiveness of our learned representations on downstream tasks, e.g., video retrieval and action recognition. Specifically, we outperform the state of the art by 7% on UCF and 4% on HMDB for video retrieval, and 5% on UCF and 6% on HMDB for video classification