对比变压器

论文标题

Epipolar Transformers

论文作者

He, Yihui, Yan, Rui, Fragkiadaki, Katerina, Yu, Shoou-I

论文摘要

在同步和校准的多视图设置中定位3D人类关节的一种常见方法是由两个步骤组成：（1）分别对每个视图分别应用一个2D检测器，以将关节定位在2D中，并且（2）对从每个视图中的2D检测执行强大的三角剖分，以获取3D接头位置。但是，在步骤1中，2D检测器仅限于解决具有挑战性的案例，这些案例可能在3D中可以更好地解决，例如遮挡和倾斜观察角，纯粹在2D中，而无需利用任何3D信息。因此，我们提出了可区分的“表现变压器”，这使得2D检测器能够利用3D感知的特征来改善2D姿势估计。直觉是：在当前视图中给定一个2D位置P，我们想首先在相邻视图中找到其相应的点p'，然后将p的功能与p处的功能相结合，从而导致p处的3D引起的功能。受立体匹配的启发，Epolar Transformer利用了表面约束，并匹配以近似P'处的特征。对手和人类360万的实验表明，我们的方法比基线有一致的改进。具体而言，在不使用外部数据的情况下，我们的Human36M模型接受了Resnet-50骨干和图像尺寸256 x 256的训练，其表现优于4.23毫米的最先进，并实现MPJPE 26.9 mm。

A common approach to localize 3D human joints in a synchronized and calibrated multi-view setup consists of two-steps: (1) apply a 2D detector separately on each view to localize joints in 2D, and (2) perform robust triangulation on 2D detections from each view to acquire the 3D joint locations. However, in step 1, the 2D detector is limited to solving challenging cases which could potentially be better resolved in 3D, such as occlusions and oblique viewing angles, purely in 2D without leveraging any 3D information. Therefore, we propose the differentiable "epipolar transformer", which enables the 2D detector to leverage 3D-aware features to improve 2D pose estimation. The intuition is: given a 2D location p in the current view, we would like to first find its corresponding point p' in a neighboring view, and then combine the features at p' with the features at p, thus leading to a 3D-aware feature at p. Inspired by stereo matching, the epipolar transformer leverages epipolar constraints and feature matching to approximate the features at p'. Experiments on InterHand and Human3.6M show that our approach has consistent improvements over the baselines. Specifically, in the condition where no external data is used, our Human3.6M model trained with ResNet-50 backbone and image size 256 x 256 outperforms state-of-the-art by 4.23 mm and achieves MPJPE 26.9 mm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题