论文标题
在点云上的3D密集字幕的空间引导变压器
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
论文作者
论文摘要
3D点云中的密集字幕是一项新兴的视觉和语言任务,涉及对象级别的3D场景理解。除了传统的3D对象检测中,除了粗糙的语义类预测和边界框回归外,3D密集字幕旨在为每个场景的对象产生关于视觉外观和空间关系的自然语言描述的进一步,更精细的实例级别标签。 To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and空间增强对象标题生成。在两个基准数据集(ScanRefer和Referit3d)上进行了评估,我们提出的SPACAP3D在[email protected]中的评估分别优于基线方法Scan2CAP 4.94%和9.61%。我们的带有源代码和补充文件的项目页面可在https://spacap3d.github.io/上找到。
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption generation. Evaluated on two benchmark datasets, ScanRefer and ReferIt3D, our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in [email protected], respectively. Our project page with source code and supplementary files is available at https://SpaCap3D.github.io/ .