火柴场：在变压器中引起注意以进行功能匹配

论文标题

火柴场：在变压器中引起注意以进行功能匹配

MatchFormer: Interleaving Attention in Transformers for Feature Matching

论文作者

Wang, Qing, Zhang, Jiaming, Yang, Kailun, Peng, Kunyu, Stiefelhagen, Rainer

论文摘要

本地功能匹配是在子像素级别上的计算密集任务。尽管基于检测器的方法和特征描述符在低文本场景中挣扎，但具有顺序提取到匹配管道的基于CNN的方法，无法利用编码器的匹配能力，并且倾向于覆盖用于匹配的解码器。相比之下，我们提出了一种新型的分层提取和匹配变压器，称为火柴场。在层次编码器的每个阶段，我们将自我注意事项与特征提取和特征匹配的跨注意事项进行了交流，从而产生了人直觉提取和匹配方案。这种匹配感知的编码器释放了过载的解码器，并使模型高效。此外，在层次结构中将自我和交叉注意相结合，可以提高匹配的鲁棒性，尤其是在低文本的室内场景或更少的室外培训数据中。得益于这样的策略，MatchFormer是效率，鲁棒性和精度的多赢解决方案。与以前的室内姿势估计中的最佳方法相比，我们的Lite Matchformer只有45％的GFLOPS，但获得了 +1.3％的精度增益和41％的运行速度提升。大型火柴配件在四个不同的基准上达到了最新的基准，包括室内姿势估计（SCANNET），室外姿势估计（Megadepth），同型估计和图像匹配（HPATCH）和视觉定位（INLOC）。

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

下载PDF全文

下载文献需遵守相关版权规定

论文标题