论文标题
SSDBCODI:基于半监督密度的聚类,并集成了异常值检测
SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers Detection Integrated
论文作者
论文摘要
聚类分析是机器学习中的关键任务之一。传统上,聚类一直是一项独立的任务,与异常检测分开。由于离群值可以大大侵蚀聚类的性能,因此少数算法试图在聚类过程中合并异常值检测。但是,这些算法中的大多数都是基于基于无监督分区的算法(例如K-均值)。鉴于这些算法的性质,它们通常无法处理复杂的非凸形形状簇。为了应对这一挑战,我们提出了SSDBCODI,这是一种半监督密度的算法。 SSDBCODI结合了基于密度的算法的优势,该算法能够处理复杂形状的簇和半监督元素,该元素具有灵活性,可以根据一些用户标签调整聚类结果。我们还将离群检测组件与聚类过程合并。基于在此过程中产生的三个分数检测到潜在的离群值:(1)可达到性得分,该得分可以衡量密度可接触的点是与标记为标记的正常对象的点,(2)局部密度得分,该局部密度得分测量了数据对象的相邻密度,(3)相似性评分的相似性,从而衡量了其最接近贴标签的外部标签的点。然后,在接下来的步骤中,将基于这三个分数为每个数据实例生成实例权重,然后用于训练分类器以进行进一步的群集和离群值检测。为了增强对拟议算法的理解,对于我们的评估,我们已经针对多个数据集上的某些最新方法运行了拟议的算法,并分别列出了除聚类外检测的结果。我们的结果表明,我们的算法可以使用少量标签获得优异的结果。
Clustering analysis is one of the critical tasks in machine learning. Traditionally, clustering has been an independent task, separate from outlier detection. Due to the fact that the performance of clustering can be significantly eroded by outliers, a small number of algorithms try to incorporate outlier detection in the process of clustering. However, most of those algorithms are based on unsupervised partition-based algorithms such as k-means. Given the nature of those algorithms, they often fail to deal with clusters of complex, non-convex shapes. To tackle this challenge, we have proposed SSDBCODI, a semi-supervised density-based algorithm. SSDBCODI combines the advantage of density-based algorithms, which are capable of dealing with clusters of complex shapes, with the semi-supervised element, which offers flexibility to adjust the clustering results based on a few user labels. We also merge an outlier detection component with the clustering process. Potential outliers are detected based on three scores generated during the process: (1) reachability-score, which measures how density-reachable a point is to a labeled normal object, (2) local-density-score, which measures the neighboring density of data objects, and (3) similarity-score, which measures the closeness of a point to its nearest labeled outliers. Then in the following step, instance weights are generated for each data instance based on those three scores before being used to train a classifier for further clustering and outlier detection. To enhance the understanding of the proposed algorithm, for our evaluation, we have run our proposed algorithm against some of the state-of-art approaches on multiple datasets and separately listed the results of outlier detection apart from clustering. Our results indicate that our algorithm can achieve superior results with a small percentage of labels.