使用卷积神经网络以及深度和惯性传感器数据的多模式融合来改善人类动作识别

论文标题

使用卷积神经网络以及深度和惯性传感器数据的多模式融合来改善人类动作识别

Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data

论文作者

Ahmad, Zeeshan, Khan, Naimul

论文摘要

本文试图通过融合深度和惯性传感器数据来提高人类行动识别（HAR）的准确性。首先，我们将深度数据转换为顺序的前视图图像（SFI），并在这些图像上微调预训练的Alexnet。然后，将惯性数据转换为信号图像（SI），并在这些图像上训练另一个卷积神经网络（CNN）。最后，从CNN中提取了学习的功能，并将其融合在一起以制成共享特征层，并将这些功能馈送到分类器中。我们尝试两个分类器，即支持向量机（SVM）和SoftMax分类器，并比较其性能。还计算了每种模式，仅深度数据和传感器数据的识别精度，并与基于融合的精度进行了比较，以强调以下事实：模态的融合比单个模态产生更好的结果。对UTD-MHAD和KINECT 2D数据集的实验结果表明，与其他最近提出的视觉持续作用识别方法相比，提出的方法可实现最新的结果。

This paper attempts at improving the accuracy of Human Action Recognition (HAR) by fusion of depth and inertial sensor data. Firstly, we transform the depth data into Sequential Front view Images(SFI) and fine-tune the pre-trained AlexNet on these images. Then, inertial data is converted into Signal Images (SI) and another convolutional neural network (CNN) is trained on these images. Finally, learned features are extracted from both CNN, fused together to make a shared feature layer, and these features are fed to the classifier. We experiment with two classifiers, namely Support Vector Machines (SVM) and softmax classifier and compare their performances. The recognition accuracies of each modality, depth data alone and sensor data alone are also calculated and compared with fusion based accuracies to highlight the fact that fusion of modalities yields better results than individual modalities. Experimental results on UTD-MHAD and Kinect 2D datasets show that proposed method achieves state of the art results when compared to other recently proposed visual-inertial action recognition methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题