论文标题
如何通过PCA和随机预测降低维度?
How to reduce dimension with PCA and random projections?
论文作者
论文摘要
在我们的“大数据”年龄中,数据的大小和复杂性正在稳步增加。缩小维度的方法越来越流行和有用。降低的两种不同类型的尺寸是“符合数据的”方法,例如随机投影和草图,以及“数据感知”方法,例如主成分分析(PCA)。两者都有其优势,例如随机预测的速度和PCA的数据适应性。在这项工作中,我们研究了如何将它们结合在一起以获得两者的最佳状态。我们研究首先采用随机投影(或素描)的“草图和解决”方法,然后计算PCA。我们在一般的“ Signal-Plus-noise”(或峰值)数据模型中计算了几种流行的素描方法(随机IID投影,随机采样,亚hadamard变换,计数素描等)的性能。与知名作品相比,我们的结果(1)给出了渐近的确切结果,并且(2)当信号成分仅略高于噪声时,但是投影尺寸不可忽略。我们还研究了更强的信号,允许更一般的协方差结构。我们发现(a)信号强度在投影下以微妙的方式降低,具体取决于数据的结构和草图方法,(b)正交投影更准确,(c)由于量度集中,(d)计数素描可以通过正常化的方法来改进,(d)随机化不会太大。我们的结果对统计学习和数据分析具有影响。我们还说明,结果在模拟和分析经验数据方面非常准确。
In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data-adaptivity for PCA. In this work, we study how to combine them to get the best of both. We study "sketch and solve" methods that take a random projection (or sketch) first, and compute PCA after. We compute the performance of several popular sketching methods (random iid projections, random sampling, subsampled Hadamard transform, count sketch, etc) in a general "signal-plus-noise" (or spiked) data model. Compared to well-known works, our results (1) give asymptotically exact results, and (2) apply when the signal components are only slightly above the noise, but the projection dimension is non-negligible. We also study stronger signals allowing more general covariance structures. We find that (a) signal strength decreases under projection in a delicate way depending on the structure of the data and the sketching method, (b) orthogonal projections are more accurate, (c) randomization does not hurt too much, due to concentration of measure, (d) count sketch can be improved by a normalization method. Our results have implications for statistical learning and data analysis. We also illustrate that the results are highly accurate in simulations and in analyzing empirical data.