SAVIR-T：带有变压器的空间细心的视觉推理

论文标题

SAVIR-T：带有变压器的空间细心的视觉推理

SAViR-T: Spatially Attentive Visual Reasoning with Transformers

论文作者

Sahu, Pritish, Basioti, Kalliopi, Pavlovic, Vladimir

论文摘要

我们提出了一种新颖的计算模型“ Savir-T”，用于在Raven的渐进式矩阵（RPM）中体现的视觉推理问题。我们的模型考虑了难题中每个图像中每个图像中视觉元素的明确空间语义，编码为空间 - 视觉令牌，并了解内部图像以及图像的依赖依赖性依赖性，与视觉推理任务高度相关。通过基于变压器的SAVIR-T体系结构建模的令牌关系，通过利用组规则相干性，并将其用作电感性偏见来提取rpm中每个令牌的基础规则表示形式，并将其用作电感偏差来提取基础规则表示，并将其用作电感偏置。我们使用此关系表示形式来找到正确的选择图像，该图像完成了RPM的最后一行或列。在包括Raven，I-Raven，Raven-Fair和PGM在内的两个合成RPM基准的广泛实验，以及基于自然图像的“ V-PROM”，这表明Savir-T为视觉推理设定了新的最新技术，超过了先前模型的性能。

We present a novel computational model, "SAViR-T", for the family of visual reasoning problems embodied in the Raven's Progressive Matrices (RPM). Our model considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies, highly relevant for the visual reasoning task. Token-wise relationship, modeled through a transformer-based SAViR-T architecture, extract group (row or column) driven representations by leveraging the group-rule coherence and use this as the inductive bias to extract the underlying rule representations in the top two row (or column) per token in the RPM. We use this relation representations to locate the correct choice image that completes the last row or column for the RPM. Extensive experiments across both synthetic RPM benchmarks, including RAVEN, I-RAVEN, RAVEN-FAIR, and PGM, and the natural image-based "V-PROM" demonstrate that SAViR-T sets a new state-of-the-art for visual reasoning, exceeding prior models' performance by a considerable margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题