论文标题
时间连贯的嵌入自我监督的视频表示学习
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning
论文作者
论文摘要
本文介绍了TCE:用于自我监督的视频表示学习的时间相干嵌入。所提出的方法利用了未标记的视频数据的固有结构来明确地在嵌入空间中执行时间连贯性,而不是通过排名或预测性代理任务间接学习它。就像世界上的高级视觉信息变化一样,我们相信,学会表示的附近框架将受益于展示类似的属性。使用此假设,我们训练TCE模型来编码视频,以使相邻帧彼此近乎近距离存在,并且视频相互分开。使用TCE,我们从大量未标记的视频数据中学习强大的表示形式。我们使用多个挑战性的基准(Kinetics400,UCF101,HMDB51)彻底分析和评估了我们的自我监督的学到的TCE模型。使用简单但有效的2D-CNN主链和仅RGB流输入,TCE预训练的表示超过了所有先前的自我保护的2D-CNN和3D-CNN,并在UCF101上预先训练了。本文的代码和预培训模型可以在以下网址下载:https://github.com/csiro-robotics/tce
This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks. In the same way that high-level visual information in the world changes smoothly, we believe that nearby frames in learned representations will benefit from demonstrating similar properties. Using this assumption, we train our TCE model to encode videos such that adjacent frames exist close to each other and videos are separated from one another. Using TCE we learn robust representations from large quantities of unlabeled video data. We thoroughly analyse and evaluate our self-supervised learned TCE models on a downstream task of video action recognition using multiple challenging benchmarks (Kinetics400, UCF101, HMDB51). With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101. The code and pre-trained models for this paper can be downloaded at: https://github.com/csiro-robotics/TCE