单个RGB帧的基于语义关键点的姿势估计

论文标题

单个RGB帧的基于语义关键点的姿势估计

Semantic keypoint-based pose estimation from single RGB frames

论文作者

Schmeckpeper, Karl, Osteen, Philip R., Wang, Yufu, Pavlakos, Georgios, Chaney, Kenneth, Jordan, Wyatt, Zhou, Xiaowei, Derpanis, Konstantinos G., Daniilidis, Kostas

论文摘要

本文提出了一种从单个RGB图像中估算对象连续的6-DOF姿势的方法。该方法结合了由卷积网络（Convnet）和可变形形状模型预测的语义关键点。与先前的研究者不同，我们对物体是纹理还是无纹理的不可知论，因为Convnet从可用的训练图像数据中学习了最佳表示。此外，该方法可以应用于实例和基于类的姿势恢复。此外，我们还伴随着主要管道，采用未标记视频的半自动数据生成技术。此过程使我们能够在标签过程中使用最小的手动干预来培训方法的可学习组件。从经验上讲，我们表明我们的方法可以准确地恢复6-DOF对象的姿势，即使在混乱的背景下，也可以为实例和基于类的方案恢复。我们将方法应用于几个现有的大规模数据集 - 包括Pascal3d+，LineMod-Cluded，YCB-Video和Tud-Light-以及使用我们的标签管道，以及带有新型对象类的新数据集，我们在此处介绍。广泛的经验评估表明，我们的方法能够提供与艺术状况相当的姿势估计结果。

This paper presents an approach to estimating the continuous 6-DoF pose of an object from a single RGB image. The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model. Unlike prior investigators, we are agnostic to whether the object is textured or textureless, as the convnet learns the optimal representation from the available training-image data. Furthermore, the approach can be applied to instance- and class-based pose recovery. Additionally, we accompany our main pipeline with a technique for semi-automatic data generation from unlabeled videos. This procedure allows us to train the learnable components of our method with minimal manual intervention in the labeling process. Empirically, we show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios even against a cluttered background. We apply our approach both to several, existing, large-scale datasets - including PASCAL3D+, LineMOD-Occluded, YCB-Video, and TUD-Light - and, using our labeling pipeline, to a new dataset with novel object classes that we introduce here. Extensive empirical evaluations show that our approach is able to provide pose estimation results comparable to the state of the art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题