通过全球对比度学习，基于骨架的视觉不变的动作识别

论文标题

通过全球对比度学习，基于骨架的视觉不变的动作识别

View-Invariant Skeleton-based Action Recognition via Global-Local Contrastive Learning

论文作者

Bian, Cunling, Feng, Wei, Meng, Fanbo, Wang, Song

论文摘要

基于骨架的人类动作识别最近引起了人们对外观变化的敏感性低以及更多骨骼数据的可访问性的敏感性。但是，即使在实践中捕获的3D骨骼也对观点和方向仍然敏感，从而阻塞了不同的人体关节和人类关节定位中的误差。骨骼数据的这种视图差异可能会严重影响动作识别的性能。为了解决这个问题，我们在本文中提出了一种新的视图不变的表示方法，而没有任何手动动作标签，用于基于骨架的人类行动识别。具体而言，我们通过最大化从不同观点提取的表示形式之间的相互信息来利用同一人员同时对同一个人进行的多视觉骨架数据，然后提出一个全局 - 局部对比度损失，以模拟多个空间和时间域中的多规模共同呈现关系。广泛的实验结果表明，所提出的方法对输入骨骼数据的视图差异是可靠的，并显着提高了基于无监督的骨架的人类作用方法的性能，从而在PKUMMD和NTU RGB+d的两个具有挑战性的多视图基准上产生了新的最新精确度。

Skeleton-based human action recognition has been drawing more interest recently due to its low sensitivity to appearance changes and the accessibility of more skeleton data. However, even the 3D skeletons captured in practice are still sensitive to the viewpoint and direction gave the occlusion of different human-body joints and the errors in human joint localization. Such view variance of skeleton data may significantly affect the performance of action recognition. To address this issue, we propose in this paper a new view-invariant representation learning approach, without any manual action labeling, for skeleton-based human action recognition. Specifically, we leverage the multi-view skeleton data simultaneously taken for the same person in the network training, by maximizing the mutual information between the representations extracted from different views, and then propose a global-local contrastive loss to model the multi-scale co-occurrence relationships in both spatial and temporal domains. Extensive experimental results show that the proposed method is robust to the view difference of the input skeleton data and significantly boosts the performance of unsupervised skeleton-based human action methods, resulting in new state-of-the-art accuracies on two challenging multi-view benchmarks of PKUMMD and NTU RGB+D.

下载PDF全文

下载文献需遵守相关版权规定

论文标题