论文标题
内部措施的可扩展分布式近似值用于聚类评估
Scalable Distributed Approximation of Internal Measures for Clustering Evaluation
论文作者
论文摘要
用于聚类评估的最广泛使用的内部度量是轮廓系数,其幼稚计算需要二次数量的距离计算,对于大量数据集来说,这显然是不可行的。令人惊讶的是,没有已知的一般方法可以有效地近似具有严格证明的高精度的聚类的轮廓系数。在本文中,我们提出了第一个根据任何度量距离评估聚类的严格近似值的可伸缩算法。我们的算法取决于概率与大小(PPS)采样方案成正比,并且对于任何固定的$ \ varepsilon,δ\ in(0,1)$,它在单纯的附加误差$ o(\ varepsilon)$(\ varepsilon)$(\ varepsilon)$中近似于silhouette系数,并使用概率$ $ $ $ $ $ $ $,使用了非常小的距离。我们还可以证明,可以对算法进行调整以获得其他内部聚类质量(例如凝聚力和分离)的严格近似值。重要的是,我们使用MapReduce模型提供了算法的分布式实现,该模型以恒定的回合运行,并且只需要每个工人的局部空间,这使我们的估计方法适用于大数据方案。我们对轮廓近似算法进行了广泛的实验评估,将其性能与对真实和合成数据集的许多基线启发式方法进行了比较。该实验提供了证据表明,与其他启发式方法不同,我们的估计策略不仅提供了严格的理论保证,而且还能够在精确计算所需的一小部分时间内返回高度准确的估计,并且其分布式实现非常可扩展,从而实现了精确计算的非常大的数据集的计算。
The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm hinges on a Probability Proportional to Size (PPS) sampling scheme, and, for any fixed $\varepsilon, δ\in (0,1)$, it approximates the silhouette coefficient within a mere additive error $O(\varepsilon)$ with probability $1-δ$, using a very small number of distance calculations. We also prove that the algorithm can be adapted to obtain rigorous approximations of other internal measures of clustering quality, such as cohesion and separation. Importantly, we provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, which makes our estimation approach applicable to big data scenarios. We perform an extensive experimental evaluation of our silhouette approximation algorithm, comparing its performance to a number of baseline heuristics on real and synthetic datasets. The experiments provide evidence that, unlike other heuristics, our estimation strategy not only provides tight theoretical guarantees but is also able to return highly accurate estimations while running in a fraction of the time required by the exact computation, and that its distributed implementation is highly scalable, thus enabling the computation of internal measures for very large datasets for which the exact computation is prohibitive.