论文标题
在动态场景上进行摄像机重新定位的图形注意网络
Graph Attention Network for Camera Relocalization on Dynamic Scenes
论文作者
论文摘要
我们设计了一种基于图形注意网络的方法,用于学习场景三角网格表示,以估算动态环境中的图像摄像头位置。先前的方法构建了一个与场景有关的模型,该模型明确或隐式嵌入了场景的结构。他们使用卷积神经网络或决策树来建立2D/3D-3D对应关系。这样的映射使目标场景过度贴合,并且不能很好地推广到环境的动态变化。我们的工作介绍了一种新颖的方法,可以使用可用的三角网格解决相机重新定位问题。我们的3D-3D匹配框架由三个块组成:(1)图形神经网络来计算网格顶点的嵌入,(2)卷积神经网络,以计算在RGB-D图像上定义的网格单元的嵌入,以及(3)神经网络模型以在两个嵌入之间建立对应关系。这三个组件是端到端训练的。为了预测最终姿势,我们运行RANSAC算法以生成相机姿势假设,并使用点云表示来完善预测。我们的方法将最先进方法的相机姿势准确性从$ 0.358 $提高到Rio10基准的$ 0.358 $,用于动态室内摄像机重新定位。
We devise a graph attention network-based approach for learning a scene triangle mesh representation in order to estimate an image camera position in a dynamic environment. Previous approaches built a scene-dependent model that explicitly or implicitly embeds the structure of the scene. They use convolution neural networks or decision trees to establish 2D/3D-3D correspondences. Such a mapping overfits the target scene and does not generalize well to dynamic changes in the environment. Our work introduces a novel approach to solve the camera relocalization problem by using the available triangle mesh. Our 3D-3D matching framework consists of three blocks: (1) a graph neural network to compute the embedding of mesh vertices, (2) a convolution neural network to compute the embedding of grid cells defined on the RGB-D image, and (3) a neural network model to establish the correspondence between the two embeddings. These three components are trained end-to-end. To predict the final pose, we run the RANSAC algorithm to generate camera pose hypotheses, and we refine the prediction using the point-cloud representation. Our approach significantly improves the camera pose accuracy of the state-of-the-art method from $0.358$ to $0.506$ on the RIO10 benchmark for dynamic indoor camera relocalization.