论文标题
软管网络:场景图生成的高阶结构嵌入式网络
HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation
论文作者
论文摘要
场景图生成旨在为图像产生结构化表示,这需要了解对象之间的关系。由于深神经网络的连续性,场景图的预测被分为对象检测和关系分类。但是,独立关系类不能很好地分开视觉特征。尽管某些方法将视觉特征组织到图形结构中,并使用消息传递来学习上下文信息,但它们仍然遭受了巨大的类内变化和不平衡的数据分布的困扰。一个重要的因素是,他们学习了一个非结构化的输出空间,该空间忽略了场景图的固有结构。因此,在本文中,我们提出了一个嵌入式网络(软管网)来减轻此问题的高阶结构。首先,我们提出了一种新颖的结构感知到分类器(SEC)模块,以将关系的局部和全球结构信息纳入输出空间。具体而言,通过基于本地图的消息传递学习了一组上下文嵌入,然后映射到基于全局结构的分类空间。其次,由于学习过多的特定于上下文的分类子空间可能会遇到数据稀疏问题,因此我们提出了一个层次的语义聚合(HSA)模块,以通过引入高阶结构信息来减少子空间的数量。 HSA还是一个快速而灵活的工具,可以根据关系知识图自动搜索语义对象层次结构。广泛的实验表明,拟议的软管网络在两个流行的视觉基因组和VRD基准上实现了最新性能。
Scene graph generation aims to produce structured representations for images, which requires to understand the relations between objects. Due to the continuous nature of deep neural networks, the prediction of scene graphs is divided into object detection and relation classification. However, the independent relation classes cannot separate the visual features well. Although some methods organize the visual features into graph structures and use message passing to learn contextual information, they still suffer from drastic intra-class variations and unbalanced data distributions. One important factor is that they learn an unstructured output space that ignores the inherent structures of scene graphs. Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. First, we propose a novel structure-aware embedding-to-classifier(SEC) module to incorporate both local and global structural information of relationships into the output space. Specifically, a set of context embeddings are learned via local graph based message passing and then mapped to a global structure based classification space. Second, since learning too many context-specific classification subspaces can suffer from data sparsity issues, we propose a hierarchical semantic aggregation(HSA) module to reduces the number of subspaces by introducing higher order structural information. HSA is also a fast and flexible tool to automatically search a semantic object hierarchy based on relational knowledge graphs. Extensive experiments show that the proposed HOSE-Net achieves the state-of-the-art performance on two popular benchmarks of Visual Genome and VRD.