点-M2AE：用于分层云预训练的多尺度蒙版自动编码器

论文标题

点-M2AE：用于分层云预训练的多尺度蒙版自动编码器

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

论文作者

Zhang, Renrui, Guo, Ziyu, Fang, Rongyao, Zhao, Bin, Wang, Dong, Qiao, Yu, Li, Hongsheng, Gao, Peng

论文摘要

蒙面自动编码器（MAE）在语言和2D图像变压器的自我监管预训练中表现出巨大的潜力。但是，关于如何利用蒙版自动编码来学习不规则点云的3D表示，仍然是一个空旷的问题。在本文中，我们提出了Point-M2AE，这是一个强大的多尺度MAE预训练框架，用于对3D点云进行分层自我监督学习。与MAE中的标准变压器不同，我们将编码器和解码器修改为金字塔体系结构，以逐步建模空间几何形状，并捕获3D形状的细粒和高级语义。对于以阶段为单位的编码器，我们设计了一种多尺度掩盖策略，以在跨尺度上产生一致的可见区域，并在微调期间采用局部空间自我注意机制，以关注相邻模式。通过多尺度令牌繁殖，轻量级解码器逐渐通过编码器中的互补的跳过连接指向令牌，这进一步从全球到本地的角度促进了重建。广泛的实验证明了对3D表示学习的点-M2AE的最新性能。在预训练后使用冷冻编码器，点-2AE在ModelNet40上的线性SVM的精度为92.9％，甚至超过了一些训练有素的方法。通过对下游任务进行微调，点M2AE在Scanobjectnn上实现了86.43％的精度， +3.36％ +3.36％，而在很大程度上使几个弹出分类，零件细分和3D对象检测受益于层次预训练方案。代码可在https://github.com/zrrskywalker/point-m2ae中找到。

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题