论文标题

选择K-Medians的聚类数量的惩罚标准

A penalized criterion for selecting the number of clusters for K-medians

论文作者

Godichon-Baggioni, Antoine, Surendran, Sobihan

论文摘要

聚类是一种通常的无监督的机器学习技术,可根据类似功能将数据点分组分组。我们在这里专注于无监督的污染数据聚类,即在K-Medians由于其稳健性而优先于K-均值的情况下。更确切地说,我们专注于聚类中的一个共同问题:如何选择集群数量?这里提出的答案是将最佳簇数量的选择视为通过惩罚最小化风险函数的最小化。在本文中,我们为我们的标准获得了合适的惩罚形状,并得出了相关的甲骨文型不平等。最后,在与其他流行技术的仿真研究中比较了这种方法使用不同类型的K-Medians算法的性能。所有研究的算法都可以在Cran的R Package Kmedians中使用。

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源