论文标题

SVFormer:半监督视频变压器用于动作识别

SVFormer: Semi-supervised Video Transformer for Action Recognition

论文作者

Xing, Zhen, Dai, Qi, Hu, Han, Chen, Jingjing, Wu, Zuxuan, Jiang, Yu-Gang

论文摘要

由于视频注释的高成本,半监督的行动识别是一项具有挑战性但至关重要的任务。现有方法主要使用卷积神经网络,但是当前的革命性视觉变压器模型却较少探索。在本文中,我们研究了在SSL设置下使用变压器模型以进行动作识别。为此,我们介绍了SVFormer,该SVFormer采用了稳定的伪标记框架(即EMA-TOCHER)来应对未标记的视频样本。尽管已显示出广泛的数据增强对半监督图像分类有效,但它们通常会产生有限的视频识别结果。因此,我们介绍了一种新颖的增强策略,Tube TokenMix,该策略是针对视频数据量身定制的,在该视频数据中,视频剪辑通过掩模在颞轴上具有一致的遮罩令牌混合。此外,我们提出了一个时间扭曲的增加,以涵盖视频中复杂的时间变化,该视频将选定的框架扩展到剪辑中的各个时间持续时间。在三个数据集Kinetics-400,UCF-101和HMDB-51上进行了广泛的实验,验证了SVFormer的优势。特别是,Svformer的表现优于最先进的31.5%,而动力学400的标签率较少,较少的培训时期。我们的方法有望可以作为强大的基准,并鼓励未来通过变压器网络对半监督行动识别的搜索。

Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源