带有时间编码和深层压缩的实时操作表示

论文标题

带有时间编码和深层压缩的实时操作表示

A Real-time Action Representation with Temporal Encoding and Deep Compression

论文作者

Liu, Kun, Liu, Wu, Ma, Huadong, Tan, Mingkui, Gan, Chuang

论文摘要

深层神经网络为基于视频的动作识别取得了巨大的成功。但是，由于高计算成本，大多数现有方法无法在实践中部署。为了应对这一挑战，我们提出了一种新的实时卷积体系结构，称为“时间卷积3D网络（T-C3D）”，以进行行动表示。 T-C3D以层次多范围的方式学习视频动作表示，同时获得高过程速度。具体而言，我们提出了一个残留的3D卷积神经网络（CNN），以捕获有关单个帧外观以及连续帧之间运动的补充信息。基于此CNN，我们开发了一种新的时间编码方法来探索整个视频的时间动态。此外，我们将深层压缩技术与T-C3D集成，以通过减少模型的大小来进一步加速模型的部署。通过这些方式，可以在进行推理时避免进行大量计算，从而使方法可以在实时速度上处理视频，同时保持有希望的性能。我们的方法在准确性方面对UCF101动作识别基准进行了明显的改进，而准确性则可以提高5.4％，而使用小于5MB的存储模型则在推理速度方面快2倍。我们通过研究其在三个不同任务的四个不同基准测试的行动表示绩效来验证我们的方法。广泛的实验表明与最先进方法相当的识别性能。源代码和预培训模型可在https://github.com/tc3d上公开获得。

Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Specifically, we propose a residual 3D Convolutional Neural Network (CNN) to capture complementary information on the appearance of a single frame and the motion between consecutive frames. Based on this CNN, we develop a new temporal encoding method to explore the temporal dynamics of the whole video. Furthermore, we integrate deep compression techniques with T-C3D to further accelerate the deployment of models via reducing the size of the model. By these means, heavy calculations can be avoided when doing the inference, which enables the method to deal with videos beyond real-time speed while keeping promising performance. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model. We validate our approach by studying its action representation performance on four different benchmarks over three different tasks. Extensive experiments demonstrate comparable recognition performance to the state-of-the-art methods. The source code and the pre-trained models are publicly available at https://github.com/tc3d.

下载PDF全文

下载文献需遵守相关版权规定

论文标题