论文标题
$ k $ -Means群集的快速噪声去除
Fast Noise Removal for $k$-Means Clustering
论文作者
论文摘要
本文认为$ k $ - 均值在噪音存在下聚类。众所周知,$ k $ - 均值聚类对噪声高度敏感,因此应删除噪声以获得质量解决方案。这个问题的流行表述称为$ k $ - 均与异常值聚类。 $ k $ -MEANS与异常值聚类的目标是将指定的数字$ z $作为噪声/异常值的$ z $,然后在其余数据上找到$ k $ -MEANS的解决方案。该问题引起了极大的关注,但是当前具有理论保证的算法遇到了较高的运行时间或固定质量固有损失。本文的主要贡献是两倍。首先,我们开发了一种简单的贪婪算法,该算法证明是最糟糕的案例保证。贪婪的算法添加了一个简单的预处理步骤,以消除噪声,可以将其与任何$ k $ - 均值聚类算法结合使用。该算法使第一个伪透明度抗性降低了$ k $ -MEANS的异常值,再到没有异常值的$ k $ -MEANS。其次,我们展示了如何构造尺寸$ o(k \ log n)$的核心。与贪婪算法结合使用时,我们将获得可扩展的线性时间算法。通过证明该算法会迅速消除噪声并获得高质量的聚类,从而对理论贡献进行了实验验证。
This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$-means clustering with outliers. The goal of $k$-means clustering with outliers is to discard up to a specified number $z$ of points as noise/outliers and then find a $k$-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any $k$-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from $k$-means with outliers to $k$-means without outliers. Secondly, we show how to construct a coreset of size $O(k \log n)$. When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.