论文标题
我们真的需要特定场景的姿势编码器吗?
Do We Really Need Scene-specific Pose Encoders?
论文作者
论文摘要
视觉姿势回归模型估算了带有单个正向通行证的查询图像的相机姿势。当前模型使用深度卷积网络从图像中学习姿势编码,这些网络是每个场景训练的。所得编码通常传递给多层感知器,以回归姿势。在这项工作中,我们提出,姿势回归不需要场景特定的姿势编码器,而可以使用训练有视觉相似性的编码。为了检验我们的假设,我们采用了几个完全连接的层的浅架构,并通过通用图像检索模型的预计编码进行训练。我们发现这些编码不仅足以回归摄像头的姿势,而且在提供给分支完全连接的体系结构时,训练有素的模型可以取得竞争性的结果,甚至超过Current \ textit \ textit {Ent-the-the-art}姿势姿势回归器。此外,我们表明,对于户外本地化,所提出的体系结构是迄今为止唯一的姿势回归器,始终以2米和5度的位置定位。
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current \textit{state-of-the-art} pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.