它需要两个：用于自我监督的视频变压器预训练的蒙版外观模特

论文标题

它需要两个：用于自我监督的视频变压器预训练的蒙版外观模特

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training

论文作者

Song, Yuxin, Yang, Min, Wu, Wenhao, He, Dongliang, Li, Fu, Wang, Jingdong

论文摘要

自我监督的视频变压器预训练最近从面具和预测的管道中受益。他们在下游视频任务和小型数据集上的卓越数据效率上表现出了出色的效率。但是，这些方法并未完全利用时间关系。在这项工作中，我们将视频中的运动提示作为额外的预测目标进行了明确调查，并提出了我们的掩盖外观 - 运动模型（MAM2）框架。具体而言，我们为此任务设计了一个编码器regressor-decoder管道。回归器将特征编码和借口任务完成分开，以使特征提取过程由编码器充分完成。为了指导编码器完全挖掘时空特征，将两个单独的解码器用于两个借口的借口，这些借口是分离的外观和运动预测。我们探索各种运动预测目标，并找出RGB差异很简单而有效。至于外观预测，将VQGAN代码作为预测目标利用。借助我们的培训前管道，可以显着加快收敛性，例如，我们只需要一半的时代，而不是最先进的视频（400 V.S. 800）即可实现竞争性能。广泛的实验结果证明我们的方法学习了广义的视频表示。值得注意的是，我们具有VIT-B的MAM2在Kinects-400上获得了82.3％的速度，在某种程度上为V2的MAM2，UCF101的差异为71.3％，HMDB51的MAM2占91.5％，为62.5％。

Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline. They have demonstrated outstanding effectiveness on downstream video tasks and superior data efficiency on small datasets. However, temporal relation is not fully exploited by these methods. In this work, we explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling (MAM2) framework. Specifically, we design an encoder-regressor-decoder pipeline for this task. The regressor separates feature encoding and pretext tasks completion, such that the feature extraction process is completed adequately by the encoder. In order to guide the encoder to fully excavate spatial-temporal features, two separate decoders are used for two pretext tasks of disentangled appearance and motion prediction. We explore various motion prediction targets and figure out RGB-difference is simple yet effective. As for appearance prediction, VQGAN codes are leveraged as prediction target. With our pre-training pipeline, convergence can be remarkably speed up, e.g., we only require half of epochs than state-of-the-art VideoMAE (400 v.s. 800) to achieve the competitive performance. Extensive experimental results prove that our method learns generalized video representations. Notably, our MAM2 with ViT-B achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.

下载PDF全文

下载文献需遵守相关版权规定

论文标题