论文标题

部分可观测时空混沌系统的无模型预测

Uni6D: A Unified CNN Framework without Projection Breakdown for 6D Pose Estimation

论文作者

Jiang, Xiaoke, Li, Donghai, Chen, Hao, Zheng, Ye, Zhao, Rui, Wu, Liwei

论文摘要

随着RGB-D传感器变得越来越负担得起,使用RGB-D图像获得高精度6D姿势估计结果成为更好的选择。最先进的方法通常使用不同的骨干来提取RGB和深度图像的功能。他们使用2D CNN用于RGB图像和每个像素点云网络用于深度数据,以及用于功能融合的融合网络。我们发现,使用两个独立骨架的基本原因是“投影崩溃”问题。在深度图像平面中,物理世界的投影3D结构由1D深度值及其内置的2D像素坐标(UV)保留。任何修改UV的空间转换,例如在CNN管道中进行调整,翻转,作物或合并操作,都会打破像素值和UV坐标之间的结合。结果,3D结构不再由修改的深度图像或特征保存。为了解决这个问题,我们提出了一种简单而有效的方法,该方法表示为Uni6d,该方法明确将额外的UV数据与RGB-D图像一起作为输入。我们的方法具有单个CNN主链的统一CNN框架,用于6D姿势估计。特别是,我们方法的体系结构基于带有两个额外头的蒙版R-CNN,一个名为RT Head的直接预测6D姿势,另一个名为ABC Head,原因是指导网络将可见点映射到3D模型中的坐标为辅助模块。这种端到端的方法平衡了简单性和准确性,在YCB-Video数据集上,与艺术状态相当的精度和7.2倍的推理速度可相当。

As RGB-D sensors become more affordable, using RGB-D images to obtain high-accuracy 6D pose estimation results becomes a better option. State-of-the-art approaches typically use different backbones to extract features for RGB and depth images. They use a 2D CNN for RGB images and a per-pixel point cloud network for depth data, as well as a fusion network for feature fusion. We find that the essential reason for using two independent backbones is the "projection breakdown" problem. In the depth image plane, the projected 3D structure of the physical world is preserved by the 1D depth value and its built-in 2D pixel coordinate (UV). Any spatial transformation that modifies UV, such as resize, flip, crop, or pooling operations in the CNN pipeline, breaks the binding between the pixel value and UV coordinate. As a consequence, the 3D structure is no longer preserved by a modified depth image or feature. To address this issue, we propose a simple yet effective method denoted as Uni6D that explicitly takes the extra UV data along with RGB-D images as input. Our method has a Unified CNN framework for 6D pose estimation with a single CNN backbone. In particular, the architecture of our method is based on Mask R-CNN with two extra heads, one named RT head for directly predicting 6D pose and the other named abc head for guiding the network to map the visible points to their coordinates in the 3D model as an auxiliary module. This end-to-end approach balances simplicity and accuracy, achieving comparable accuracy with state of the arts and 7.2x faster inference speed on the YCB-Video dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源