Yoro-轻巧的端到端视觉接地

论文标题

Yoro-轻巧的端到端视觉接地

YORO -- Lightweight End to End Visual Grounding

论文作者

Ho, Chih-Hui, Appalaraju, Srikar, Jasani, Bhavan, Manmatha, R., Vasconcelos, Nuno

论文摘要

我们提出Yoro-一种用于视觉接地（VG）任务的多模式变压器编码架构。此任务涉及在图像中定位通过自然语言引用的对象。与使用多阶段方法以牺牲准确性的速度的文献趋势不同，Yoro通过在没有CNN骨干的情况下采用单级设计来寻求在速度之间的折衷。 Yoro使用单个变压器编码器，消耗自然语言查询，图像补丁和可学习的检测令牌，并预测引用对象的坐标。为了帮助文本和视觉对象之间的对齐，提出了一种新颖的补丁文本对齐损失。在5个不同的数据集上进行了大量实验，并在建筑设计选择上进行消融。证明YORO支持实时推理，并以大边距胜过此类（单阶段方法）的所有方法。它也是最快的VG模型，并在文献中实现了最佳的速度/准确性权衡。

We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. YORO consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object, using a single transformer encoder. To assist the alignment between text and visual objects, a novel patch-text alignment loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations on architecture design choices. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins. It is also the fastest VG model and achieves the best speed/accuracy trade-off in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题