论文标题
关于K-均值聚类的效率:评估,优化和算法选择
On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection
论文作者
论文摘要
本文对现有方法进行了彻底的评估,该方法加速了劳埃德(Lloyd)的算法,用于快速K-均值聚类。为此,我们分析了现有方法的修剪机制,并将其共同管道总结为统一的评估框架Unik。 Unik包含一类知名方法,并实现了细粒度的性能故障。在Unik中,我们使用许多数据集上的多个性能指标彻底评估了现有方法的利弊。此外,我们在Unik上得出了一种优化的算法,该算法有效地杂交了多种现有方法以进行更具侵略性的修剪。为了进一步,我们调查是否可以通过机器学习自动选择给定聚类任务的最有效方法,从而使从业人员和研究人员受益。
This paper presents a thorough evaluation of the existing methods that accelerate Lloyd's algorithm for fast k-means clustering. To do so, we analyze the pruning mechanisms of existing methods, and summarize their common pipeline into a unified evaluation framework UniK. UniK embraces a class of well-known methods and enables a fine-grained performance breakdown. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. Furthermore, we derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning. To take this further, we investigate whether the most efficient method for a given clustering task can be automatically selected by machine learning, to benefit practitioners and researchers.