论文标题
PLA:语言驱动的开放式Vocabulary 3D场景理解
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
论文作者
论文摘要
开放式摄影场景的理解旨在将带有带注释的标签空间以外的看不见类别进行本地化和识别。 2D开放式视频感知的最新突破在很大程度上是由互联网尺度配对的图像文本数据和丰富的词汇概念驱动的。但是,由于大规模3D文本对的无法获取,因此无法将此成功直接传输到3D场景。为此,我们建议通过在3D的字幕上为预训练的视觉语言(VL)基础模型编码的知识提炼知识,从而可以明确关联3D和富含语义的字幕。此外,为了从字幕中培养粗到最新的视觉语义表示学习,我们设计了层次结构的3D符合符号对,利用3D场景和多视图图像之间的几何约束。最后,通过采用对比度学习,该模型学习了将3D和文本连接到开放式唱机任务的语言感知的嵌入。我们的方法不仅可以显着胜过基线方法25.8%$ \ sim $ 44.7%HIOU和14.5%$ \ sim $ 50.4%hap $ _ {50} $在开放式播放语义语义和实例段中,但在挑战性的零散发域转移任务上也显示出可靠的转移性。请参阅项目网站https://dingry.github.io/projects/pla。
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to foster coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and 14.5% $\sim$ 50.4% hAP$_{50}$ in open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. See the project website at https://dingry.github.io/projects/PLA.