用于第一人称互动识别的三流融合网络

论文标题

用于第一人称互动识别的三流融合网络

Three-Stream Fusion Network for First-Person Interaction Recognition

论文作者

Kim, Ye-Ji, Lee, Dong-Gyu, Lee, Seong-Whan

论文摘要

第一人称互动识别是一项具有挑战性的任务，因为摄像头佩戴者的运动导致的视频条件不稳定。对于从第一人称角度来看的人类互动识别，本文提出了一个具有两个主要部分的三个流融合网络：三个流体系结构和三流相关融合。三个流架结构捕获了目标外观，目标运动和相机自我运动的特征。同时，三个流相关融合结合了三个流中每个流的特征图，以考虑目标外观，目标运动和摄像机移动之间的相关性。融合的功能向量对摄像机运动非常强大，并补偿了相机自我运动的噪音。短期间隔是使用融合特征向量进行建模的，而长期记忆（LSTM）模型考虑了视频的时间动力学。我们在两公共基准数据集上评估了提出的方法，以验证我们方法的有效性。实验结果表明，所提出的融合方法成功地生成了一个判别特征向量，而我们的网络在出现相当大的相机自我运动的情况下，在第一人称视频中优于所有竞争活动识别方法。

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearer's movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. Thre three-stream architecture captures the characteristics of the target appearance, target motion, and camera ego-motion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory(LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two-public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题