与BERT进行行动识别的3D CNN体系结构中的晚期临时建模

论文标题

与BERT进行行动识别的3D CNN体系结构中的晚期临时建模

Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

论文作者

Kalfaoglu, M. Esat, Kalkan, Sinan, Alatan, A. Aydin

论文摘要

在这项工作中，我们将3D卷积与较晚的时间建模相结合，以进行动作识别。为此，我们用来自变形金刚（BERT）层的双向编码器表示，在3D卷积体系结构结束时替换了常规的时间全球平均池（TGAP）层，以便使用BERT的注意机制更好地利用时间信息。我们表明，这种替代品改善了许多流行的3D卷积架构以进行动作识别，包括Resnext，i3d，Slowfast和R（2+1）d。此外，我们分别在HMDB51和UCF101数据集上分别提供85.10％和98.69％TOP-1准确性的状态结果。该代码公开可用。

In this work, we combine 3D convolution with late temporal modeling for action recognition. For this aim, we replace the conventional Temporal Global Average Pooling (TGAP) layer at the end of 3D convolutional architecture with the Bidirectional Encoder Representations from Transformers (BERT) layer in order to better utilize the temporal information with BERT's attention mechanism. We show that this replacement improves the performances of many popular 3D convolution architectures for action recognition, including ResNeXt, I3D, SlowFast and R(2+1)D. Moreover, we provide the-state-of-the-art results on both HMDB51 and UCF101 datasets with 85.10% and 98.69% top-1 accuracy, respectively. The code is publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题