论文标题
识别静态和动态手势的深度学习框架
A Deep Learning Framework for Recognizing both Static and Dynamic Gestures
论文作者
论文摘要
直观的用户界面与人类以人为中心的智能环境相互作用是必不可少的。在本文中,我们提出了一个统一的框架,该框架使用简单的RGB视觉(无深度传感)识别静态和动态手势。此功能使其适合在社会或工业环境中廉价的人类机器人互动。我们采用姿势驱动的空间注意策略,该策略指导我们提出的静态和动态手势网络 - stadnet。从人类上半身的形象中,我们估计他/她的深度以及他/她的手周围的利益。 Stadnet中的卷积神经网络在背景取代的手势数据集上进行了微调。它用于检测每只手的10个静态手势,并获得手动图像插件。随后将它们与增强的姿势向量融合在一起,然后传递到堆叠的长期记忆块。因此,从增强的姿势向量和左/右手图像插件中汇总的以人为中心的框架信息及时聚合,以预测表现人的动态手势。在许多实验中,我们表明所提出的方法超过了大规模Chalearn 2016数据集中的最新结果。此外,我们将通过建议的方法学学习的知识转移到Praxis手势数据集中,所获得的结果也超过了该数据集上的最新时间。
Intuitive user interfaces are indispensable to interact with the human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-robot interaction in social or industrial settings. We employ a pose-driven spatial attention strategy, which guides our proposed Static and Dynamic gestures Network - StaDNet. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Network in StaDNet is fine-tuned on a background-substituted hand gestures dataset. It is utilized to detect 10 static gestures for each hand as well as to obtain the hand image-embeddings. These are subsequently fused with the augmented pose vector and then passed to the stacked Long Short-Term Memory blocks. Thus, human-centred frame-wise information from the augmented pose vector and from the left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments, we show that the proposed approach surpasses the state-of-the-art results on the large-scale Chalearn 2016 dataset. Moreover, we transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.