通过压缩比捕获PCA的脱氧作用

论文标题

通过压缩比捕获PCA的脱氧作用

Capturing the Denoising Effect of PCA via Compression Ratio

论文作者

Mukherjee, Chandra Sekhar, Doerkar, Nikhil, Zhang, Jiapeng

论文摘要

主成分分析（PCA）是机器学习中最基本的工具之一，广泛用作降低维度和降低工具。在后来的情况下，虽然已知PCA在子空间恢复方面有效，并且被证明可以在某些特定设置中有助于聚类算法，但其噪声数据的改进通常仍未得到很好的量化。在本文中，我们提出了一个称为\ emph {压缩比}的新型度量，以捕获PCA对高维噪声数据的影响。我们表明，对于具有\ emph {基础社区结构}的数据，PCA显着降低了属于同一社区的数据点的距离，同时相对较小地降低社区间距离。我们通过理论证明和现实数据实验来解释这一现象。在这个新的度量标准的基础上，我们设计了一种直接的算法，可用于检测异常值。粗略地说，我们认为具有\ emph {压缩比的较低方差}的点不会与其他人共享\ emph {comma signal}（因此可以认为是异常值）。我们为这种简单的离群检测算法提供了理论上的理由，并使用仿真证明我们的方法与流行的离群检测工具具有竞争力。最后，我们在现实世界高维噪声数据（单细胞RNA-Seq）上进行实验，以表明通过我们的离群检测方法从这些数据集中删除点可以提高聚类算法的准确性。在此任务中，我们的方法与流行的离群检测工具非常有竞争力。

Principal component analysis (PCA) is one of the most fundamental tools in machine learning with broad use as a dimensionality reduction and denoising tool. In the later setting, while PCA is known to be effective at subspace recovery and is proven to aid clustering algorithms in some specific settings, its improvement of noisy data is still not well quantified in general. In this paper, we propose a novel metric called \emph{compression ratio} to capture the effect of PCA on high-dimensional noisy data. We show that, for data with \emph{underlying community structure}, PCA significantly reduces the distance of data points belonging to the same community while reducing inter-community distance relatively mildly. We explain this phenomenon through both theoretical proofs and experiments on real-world data. Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a \emph{lower variance of compression ratio} do not share a \emph{common signal} with others (hence could be considered outliers). We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms. Our method is very competitive with popular outlier detection tools in this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题