3D人姿势估计使用时空网络和明确的遮挡训练

论文标题

3D人姿势估计使用时空网络和明确的遮挡训练

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

论文作者

Cheng, Yu, Yang, Bo, Wang, Bo, Tan, Robby T.

论文摘要

尽管近年来取得了重大进展，但从单眼视频中估算3D姿势仍然是一项艰巨的任务。通常，当目标人太小/大，或相对于训练数据的规模和速度太快/缓慢时，现有方法的性能会下降。此外，据我们所知，这些方法中的许多方法均未在严重的闭塞下明确设计或训练，从而使其在处理闭塞方面的性能受到损害。在解决这些问题时，我们引入了一个时空网络，以进行鲁棒的3D人体姿势估计。由于视频中的人可能会以不同的尺度出现并具有各种运动速度，因此我们将多尺度的空间特征应用于每个单独框架中的2D接头或关键点的预测，以及多层时间卷积网络工作（TCN）以估算3D关节或关键点。此外，我们根据身体结构以及肢体运动设计了一个时空鉴别因子，以评估预测的姿势是否形成有效的姿势和有效的运动。在训练过程中，我们明确掩盖了一些关键点，以模拟从次要到严重闭塞的各种遮挡病例，以便我们的网络可以更好地学习并在各种遮挡度上变得强大。由于存在有限的3D地面真实数据，因此我们进一步利用2D视频数据将半监视的学习能力注入我们的网络。公共数据集上的实验验证了我们方法的有效性，而消融研究表明了我们的网络单个子模型的优势。

Estimating 3D poses from a monocular video is still a challenging task, despite the significant progress that has been made in recent years. Generally, the performance of existing methods drops when the target person is too small/large, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, to our knowledge, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional net-works (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground-truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Experiments on public datasets validate the effectiveness of our method, and our ablation studies show the strengths of our networkś individual submodules.

下载PDF全文

下载文献需遵守相关版权规定

论文标题