论文标题
通过特征选择聚类高维数据
Clustering High-dimensional Data via Feature Selection
论文作者
论文摘要
在统计和机器学习中,高维聚类分析是一个具有挑战性的问题,具有广泛的应用,例如对微阵列数据和RNA-seq数据的分析。在本文中,我们提出了一个新的聚类过程,称为Spectral clustering具有特征选择(SC-FS),在该过程中,我们首先通过频谱聚类获得标签的初始估计,然后选择一小部分具有这些标签的最大R-squared的特征,即,通过组标记,并使用选定的功能进行了分组的差异比例。在轻度条件下,我们证明所提出的方法以很高的可能性确定了所有信息特征,并实现了稀疏高斯混合物模型的最小值最佳聚类错误率。 SC-F对四个现实世界数据集的应用显示了其在聚集高维数据的有用性。
High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, i.e., the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real world data sets demonstrate its usefulness in clustering high-dimensional data.