论文标题
consnet:零拍的人类对象互动检测的学习一致性图
ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection
论文作者
论文摘要
我们考虑了人类对象相互作用(HOI)检测的问题,该问题旨在以<人类,动作,对象>的形式定位和识别HOI实例。大多数现有作品将HOI视为个人交互类别,因此无法处理长尾分布和动作标签的多脑中的问题。我们认为,对象,动作和相互作用之间的多层次一致性是产生稀有或以前看不见的HOIS语义表示的强大线索。利用HOI标签的组成和关系特殊性,我们提出了Consnet,这是一种知识吸引的框架,将对象,动作和互动之间的关系明确编码为一个名为“一致性”图形的无方向图,并利用图形注意力网络(GATS)在HOI类别中传播HOI类别以及其组成型知识。我们的模型将候选人对象对的视觉特征和HOI标签的单词嵌入方式作为输入,将它们映射到视觉语义关节嵌入空间中,并通过测量其相似性来获得检测结果。我们对挑战性的V-Coco和HICO-DET数据集进行了广泛的评估,并结果验证了我们的方法在完全监督和零摄像机设置下的表现优于最先进的方法。代码可在https://github.com/yeliudev/consnet上找到。
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of <human, action, object> in images. Most existing works treat HOIs as individual interaction categories, thus can not handle the problem of long-tail distribution and polysemy of action labels. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities. We extensively evaluate our model on the challenging V-COCO and HICO-DET datasets, and results validate that our approach outperforms state-of-the-arts under both fully-supervised and zero-shot settings. Code is available at https://github.com/yeliudev/ConsNet.