暂时金字塔网络用于行动识别

论文标题

暂时金字塔网络用于行动识别

Temporal Pyramid Network for Action Recognition

论文作者

Yang, Ceyuan, Xu, Yinghao, Shi, Jianping, Dai, Bo, Zhou, Bolei

论文摘要

视觉节奏表征动作的动力学和时间尺度。建模这种视觉节奏的不同动作促进了他们的认识。以前的作品通常通过以多速率对原始视频进行采样并构建输入级框架金字塔来捕获视觉节奏，这通常需要一个昂贵的多支球网络才能处理。在这项工作中，我们在功能级别提出了一个通用的时间金字塔网络（TPN），可以以插件的方式灵活地集成到2D或3D骨干网络中。 TPN的两个基本组成部分，即功能的来源和功能的融合，形成了骨干的特征层次结构，以便它可以在各种节奏下捕获动作实例。 TPN还显示出比几个动作识别数据集中其他具有挑战性的基线的一致改进。具体而言，当配备TPN时，带有密集采样的3D Resnet-50在Kinetics-400的验证集上获得了2％的增长。进一步的分析还表明，TPN在其视觉节奏中具有较大差异的动作类别上获得了大部分改进，从而验证了TPN的有效性。

Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling such visual tempos of different actions facilitates their recognition. Previous works often capture the visual tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. Two essential components of TPN, the source of features and the fusion of features, form a feature hierarchy for the backbone so that it can capture action instances at various tempos. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling obtains a 2% gain on the validation set of Kinetics-400. A further analysis also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题