论文标题
重新考虑公正场景图的评估
Rethinking the Evaluation of Unbiased Scene Graph Generation
论文作者
论文摘要
当前的场景图(SGG)方法倾向于预测频繁的谓词类别,并且由于谓词的严重不平衡分布而无法识别稀有类别。为了提高SGG模型在不同谓词类别上的鲁棒性,最近的研究集中在无偏见的SGG上,并采用了Mean Recess@K(MR@K)作为主要评估指标。但是,我们发现了关于这个事实上的标准指标的两个被忽视的问题,这使得当前无偏见的SGG评估易受伤害和不公平:1)@K先生忽略了谓词之间的相关性,并且在将所有三胞胎预测共同排名无关,而不管谓词不管谓语类别,而无意中都会破坏类别的独立性。 2)MR@k忽略了不同谓词的组成多样性,并将过高的权重分配给某些过度简化类别样品,具有有限的组合关系三重态类型。此外,我们研究了对象和谓词之间的探索不足的相关性,这可以作为无偏SGG的简单但强大的基线。在本文中,我们完善了MR@K,并提出了两个公正SGG的互补评估指标:独立的平均召回(MR)和加权IMR(WIMR)。这两个指标的设计是通过分别考虑组合关系三胞胎的类别独立性和多样性来设计的。我们通过广泛的实验将提出的指标与事实上的标准指标进行了比较,并讨论了以更值得信赖的方式评估无偏见的解决方案。
Current Scene Graph Generation (SGG) methods tend to predict frequent predicate categories and fail to recognize rare ones due to the severe imbalanced distribution of predicates. To improve the robustness of SGG models on different predicate categories, recent research has focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation metric. However, we discovered two overlooked issues about this de facto standard metric, which makes current unbiased SGG evaluation vulnerable and unfair: 1) mR@K neglects the correlations among predicates and unintentionally breaks category independence when ranking all the triplet predictions together regardless of the predicate categories. 2) mR@K neglects the compositional diversity of different predicates and assigns excessively high weights to some oversimple category samples with limited composable relation triplet types. In addition, we investigate the under-explored correlation between objects and predicates, which can serve as a simple but strong baseline for unbiased SGG. In this paper, we refine mR@K and propose two complementary evaluation metrics for unbiased SGG: Independent Mean Recall (MR) and weighted IMR (wIMR). These two metrics are designed by considering the category independence and diversity of composable relation triplets, respectively. We compare the proposed metrics with the de facto standard metrics through extensive experiments and discuss the solutions to evaluate unbiased SGG in a more trustworthy way.