关系网++：通过变压器解码器进行对象检测的桥接视觉表示

论文标题

关系网++：通过变压器解码器进行对象检测的桥接视觉表示

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

论文作者

Chi, Cheng, Wei, Fangyun, Hu, Han

论文摘要

现有的对象检测框架通常以对象/零件表示形式的单个格式建立，即视网膜中的锚/提案矩形盒和更快的R-CNN，FCOS和REPPOINTS中的中心点，以及Cornernet中的角点。尽管这些不同的表示通常会驱动框架在不同方面的表现良好，例如更好的分类或更细微的本地化，但通常很难将这些表示形式组合到一个框架中，以充分利用每个强度，这是由于不同表示的异质或非网格特征提取。本文介绍了一个基于注意力的解码器模块，与变压器〜\ cite {vaswani2017发言}相似，将其他表示形式桥接成以端到端方式建立在单个表示格式的典型对象检测器中。其他表示形式充当一组\ emph {key}实例，以增强香草检测器中的主\ emph {query}的特征。提出了新的技术来有效地计算解码器模块，包括\ emph {键采样}方法和\ emph {共享位置嵌入}方法。提出的模块命名为\ emph {桥接视觉表示}（BVR）。它可以在就位执行，我们在将其他表示形式桥接成普遍的对象检测框架中，包括视网膜，更快的R-CNN，FCOS和ATSS，在其中实现了约1.5美元的$ 1.5 \ sim3.0 $ ap的改进。特别是，我们将强力骨干的最新框架提高了约2.0美元的AP，在可可Test-DEV上达到了52.7美元的AP。所得网络命名为RealationNet ++。该代码将在https://github.com/microsoft/RelelationNet2上找到。

Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While these different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of \emph{key} instances to strengthen the main \emph{query} representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a \emph{key sampling} approach and a \emph{shared location embedding} approach. The proposed module is named \emph{bridging visual representations} (BVR). It can perform in-place and we demonstrate its broad effectiveness in bridging other representations into prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS, where about $1.5\sim3.0$ AP improvements are achieved. In particular, we improve a state-of-the-art framework with a strong backbone by about $2.0$ AP, reaching $52.7$ AP on COCO test-dev. The resulting network is named RelationNet++. The code will be available at https://github.com/microsoft/RelationNet2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题