从本地规模对准的单眼深度重建3D场景

论文标题

从本地规模对准的单眼深度重建3D场景

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

论文作者

Xu, Guangkai, Yin, Wei, Chen, Hao, Shen, Chunhua, Cheng, Kai, Wu, Feng, Zhao, Feng

论文摘要

现有的单眼深度估计方法在不同的场景中实现了出色的鲁棒性，但它们只能检索仿生的深度，最多可达到未知的规模和变化。但是，在某些基于视频的场景中，例如视频中的视频深度估计和3D场景重建，驻留在人均预测中的未知量表和偏移可能会导致深度不一致。为了解决这个问题，我们提出了一种局部加权的线性回归方法，以恢复尺度并以非常稀疏的锚点移动，从而确保沿连续帧的比例一致性。广泛的实验表明，我们的方法可以在几个零击基准测试中最多将现有最新方法的性能提高50％。此外，我们合并了超过630万个RGBD图像，以训练强大而健壮的深度模型。我们产生的Resnet50-Backbone模型甚至优于最先进的DPT VIT-LARGE模型。结合基于几何的重建方法，我们制定了一种新的致密3D场景重建管道，该管道受益于稀疏点的规模一致性和单眼方法的鲁棒性。通过对视频进行简单的人均预测，可以恢复准确的3D场景形状。

Existing monocular depth estimation methods have achieved excellent robustness in diverse scenes, but they can only retrieve affine-invariant depth, up to an unknown scale and shift. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. To solve this problem, we propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points, which ensures the scale consistency along consecutive frames. Extensive experiments show that our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge over 6.3 million RGBD images to train strong and robust depth models. Our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combining with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing the simple per-frame prediction over a video, the accurate 3D scene shape can be recovered.

下载PDF全文

下载文献需遵守相关版权规定

论文标题