论文标题
Sparse4d:多视图3D对象检测,稀疏时空融合
Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion
论文作者
论文摘要
鸟眼视图(BEV)方法最近在多视图3D检测任务中取得了巨大进展。与基于BEV的方法相比,基于稀疏的方法落后于性能,但仍然具有许多不可忽略的优点。为了进一步推动稀疏的3D检测,在这项工作中,我们引入了一种名为Sparse4D的新方法,该方法通过稀疏采样和融合时空特征对锚盒进行了迭代的改进。 (1)稀疏4D采样:对于每个3D锚,我们分配了多个4D关键点,然后将其投影到多视图/比例/时间戳图像特征到示例相应的功能; (2)层次结构特征融合:我们从分层融合了不同视图/比例,不同时间戳和不同关键点的功能,以生成高质量的实例功能。这样,Sparse4D可以无需依赖密集的视图转换或全球关注而有效地实现3D检测,并且对Edge设备部署更友好。此外,我们引入了一个实例级的深度重量重量模块,以减轻3D到2D投影中的不良问题。在实验中,我们的方法在Nuscenes数据集中胜过所有基于稀疏的方法和大多数基于BEV的方法。
Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.