PointMCD：通过多视图的跨模式蒸馏增强深点云编码器3D形状识别

论文标题

PointMCD：通过多视图的跨模式蒸馏增强深点云编码器3D形状识别

PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition

论文作者

Zhang, Qijian, Hou, Junhui, Qian, Yue

论文摘要

作为3D对象的两个基本表示方式，3D点云和多视图2D图像记录了来自几何结构和视觉外观不同领域的形状信息。在当前的深度学习时代，通过分别自定义兼容的3D和2D网络体系结构来处理这两种数据模式。但是，与基于多视图图像的2D视觉建模范式不同，这些范式在几种常见的3D形状识别基准中表现出领先的性能，基于点云的3D几何建模范例仍然受到学习能力不足的高度限制，由于难以从不规则的几何学信号中提取歧视性特征。在本文中，我们探讨了通过从深2D图像编码器中提取的视觉知识来提高深3D点云编码器的可能性。通常，我们提出了PointMCD，这是一种统一的多视图跨模式蒸馏架构，其中包括预处理的深层图像编码器，作为老师和作为学生的深点编码器。为了在2D Visual和3D几何域之间执行异质特征对齐，我们进一步研究了可见性感知性特征投影（VAFP），通过哪个点嵌入到特定视图的几何描述符中。通过对多视视觉视觉和几何描述符对齐，我们可以在无需疲惫和复杂的网络修改的情况下获得更强大的深点编码器。对3D形状分类，部分分割和无监督学习的实验强烈验证了我们方法的有效性。代码和数据将在https://github.com/keeganhk/pointMCD上公开获取。

As two fundamental representation modalities of 3D objects, 3D point clouds and multi-view 2D images record shape information from different domains of geometric structures and visual appearances. In the current deep learning era, remarkable progress in processing such two data modalities has been achieved through respectively customizing compatible 3D and 2D network architectures. However, unlike multi-view image-based 2D visual modeling paradigms, which have shown leading performance in several common 3D shape recognition benchmarks, point cloud-based 3D geometric modeling paradigms are still highly limited by insufficient learning capacity, due to the difficulty of extracting discriminative features from irregular geometric signals. In this paper, we explore the possibility of boosting deep 3D point cloud encoders by transferring visual knowledge extracted from deep 2D image encoders under a standard teacher-student distillation workflow. Generally, we propose PointMCD, a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. To perform heterogeneous feature alignment between 2D visual and 3D geometric domains, we further investigate visibility-aware feature projection (VAFP), by which point-wise embeddings are reasonably aggregated into view-specific geometric descriptors. By pair-wisely aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification. Experiments on 3D shape classification, part segmentation, and unsupervised learning strongly validate the effectiveness of our method. The code and data will be publicly available at https://github.com/keeganhk/PointMCD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题