论文标题
使用内部验证措施的理智检查外部聚类验证基准测试
Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures
论文作者
论文摘要
我们解决了基于标签数据集基准基准聚类技术的可靠性。外部聚类验证中的标准方案是基于每个类形成一个单个,明显分开的群集的假设,将类标签用作地面真实群集。但是,由于这种集群标签匹配(CLM)的假设经常破裂,因此缺乏对基准数据集CLM的理智检查对外部验证有效性的怀疑。尽管如此,评估CLM的程度还是具有挑战性的。例如,内部聚类验证措施可用于量化同一数据集中的CLM以评估其不同的聚类,但并非旨在比较不同数据集的聚类。在这项工作中,我们提出了一种有原则的方法来生成数据集中的内部度量,以使CLM在跨数据集进行比较。我们首先确定了数据集内措施之间的四个公理,并补充了Ackerman和Ben-David的数据库内公理。然后,我们提出了概括内部度量以实现这些新公理的过程,并使用它们扩展了广泛使用的Calinski-Harabasz索引,以进行数据库CLM之间的评估。通过定量实验,我们(1)验证了概括过程的有效性和必要性,(2)表明,所提出的数据集Calinski-Harabasz指数可以准确评估跨数据集的CLM。最后,我们证明了在进行外部验证之前评估基准数据集的CLM的重要性。
We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine four axioms for between-dataset internal measures, complementing Ackerman and Ben-David's within-dataset axioms. We then propose processes to generalize internal measures to fulfill these new axioms, and use them to extend the widely used Calinski-Harabasz index for between-dataset CLM evaluation. Through quantitative experiments, we (1) verify the validity and necessity of the generalization processes and (2) show that the proposed between-dataset Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we demonstrate the importance of evaluating CLM of benchmark datasets before conducting external validation.