迭代场景图生成

论文标题

迭代场景图生成

Iterative Scene Graph Generation

论文作者

Khandelwal, Siddhesh, Sigal, Leonid

论文摘要

场景图的任务需要在给定图像（或视频）中识别对象实体及其相应的交互谓词。由于组合较大的解决方案空间，现有的场景图生成方法假设关节分布的某些分解以使估计可行（例如，假设对象与谓词预测无关）。但是，在所有情况下，这种固定的分解并不是理想的（例如，对于相互作用中需要的对象很小且本身不可辨别的图像）。在这项工作中，我们建议使用马尔可夫随机字段传递的消息，为场景图生成的新框架，并在图像上引入动态调节。这是作为迭代改进过程实现的，其中每个修改都在上一个迭代中生成的图上进行条件。这种跨改进步骤的条件允许对实体和关系进行联合推理。该框架是通过小说和端到端的可训练变压器建筑实现的。此外，拟议的框架可以改善现有的方法性能。通过有关视觉基因组和动作基因组基准数据集的广泛实验，我们在场景图生成上显示了提高的性能。

The task of scene graph generation entails identifying object entities and their corresponding interaction predicates in a given image (or video). Due to the combinatorially large solution space, existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation feasible (e.g., assuming that objects are conditionally independent of predicate predictions). However, this fixed factorization is not ideal under all scenarios (e.g., for images where an object entailed in interaction is small and not discernible on its own). In this work, we propose a novel framework for scene graph generation that addresses this limitation, as well as introduces dynamic conditioning on the image, using message passing in a Markov Random Field. This is implemented as an iterative refinement procedure wherein each modification is conditioned on the graph generated in the previous iteration. This conditioning across refinement steps allows joint reasoning over entities and relations. This framework is realized via a novel and end-to-end trainable transformer-based architecture. In addition, the proposed framework can improve existing approach performance. Through extensive experiments on Visual Genome and Action Genome benchmark datasets we show improved performance on the scene graph generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题