论文标题

与一般指标的异常值分布式K-均值

Distributed k-Means with Outliers in General Metrics

论文作者

Dandolo, Enrico, Pietracaprina, Andrea, Pucci, Geppino

论文摘要

基于中心的聚类是无监督学习和数据分析的关键原始性。毫无疑问,流行的变体是K-均值问题,鉴于从公制空间中获得了$ p $点的积分和参数$ k <| p | $,需要确定$ k $中心的子集$ s $ s $ s $ s $ s $ k $中心最小化$ p $的所有平方距离的总和。引入了一种更通用的公式,称为$ z $ outliers的k均值,用于处理嘈杂的数据集,具有进一步的参数$ z $,并且在计算上述总和时,最多允许$ z $ z $ z $ z $ z $ z $ z $(离群)。我们使用MapReduce作为计算模型提出了基于$ z $ outliers的$ z $ outliers的基于分布式核心的三轮近似算法。我们的分布式算法需要每个还原器的均值局部内存,并产生一个近似值的解决方案,其近似值是添加术语$ o(γ)$远离可通过最知名的顺序(可能是双学位)算法来实现的算法,其中$γ$可以任意地使$γ$变小。我们的算法的一个重要特征是,它忽略了数据集的内在复杂性,该复杂性是由度量空间的加倍尺寸$ d $捕获的。据我们所知,以前的分布式方法都无法为一般指标实现类似的质量表现折衷。

Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set $P$ of points from a metric space and a parameter $k<|P|$, requires to determine a subset $S$ of $k$ centers minimizing the sum of all squared distances of points in $P$ from their closest center. A more general formulation, known as k-means with $z$ outliers, introduced to deal with noisy datasets, features a further parameter $z$ and allows up to $z$ points of $P$ (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with $z$ outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term $O(γ)$ away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where $γ$ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源