论文标题
视频识别模型的大规模鲁棒性分析
Large-scale Robustness Analysis of Video Action Recognition Models
论文作者
论文摘要
近年来,我们看到了视频动作识别方面取得的巨大进展。有几种基于卷积神经网络(CNN)的模型,以及一些基于变压器的方法,这些方法为现有基准提供了最佳性能。在这项工作中,我们对这些现有模型进行了大规模的鲁棒性分析,以供视频识别。我们专注于对现实世界分布转移扰动而不是对抗扰动的鲁棒性。我们提出了四个不同的基准数据集,即HMDB51-P,UCF101-P,Kinetics400-P和SSV2-P来执行此分析。我们研究了六个针对90种不同扰动的最先进的动作识别模型的鲁棒性。该研究揭示了一些有趣的发现,1)与基于CNN的模型相比,基于变压器的模型比基于CNN的模型比基于CNN的模型的鲁棒性更具稳健性,而基于CNN的模型则更具鲁棒性,3)所有研究模型对于所有数据集但SSV2的所有研究模型均与临时扰动均具有鲁棒性。建议时间信息对行动识别的重要性根据数据集和活动而有所不同。接下来,我们研究增强在模型鲁棒性中的作用,并提出一个现实世界中的数据集UCF101-DS,其中包含现实的分布变化,以进一步验证其中一些发现。我们认为,这项研究将是在强大的视频行动识别中进行未来研究的基准。
We have seen a great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) and some recent transformer based approaches which provide top performance on existing benchmarks. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against real-world distribution shift perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB51-P, UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study robustness of six state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2; suggesting the importance of temporal information for action recognition varies based on the dataset and activities. Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF101-DS, which contains realistic distribution shifts, to further validate some of these findings. We believe this study will serve as a benchmark for future research in robust video action recognition.