高维中的子空间聚类：相变和统计到计算间隙

论文标题

高维中的子空间聚类：相变和统计到计算间隙

Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap

论文作者

Pesce, Luca, Loureiro, Bruno, Krzakala, Florent, Zdeborová, Lenka

论文摘要

一个简单的研究子空间聚类的模型是高维$ k $ -GAUSSIAN混合模型，其中群集均值为稀疏的向量。在这里，我们在该模型中，在具有广泛稀疏性的高维度中，在该模型中提供了统计上最佳重建误差的精确表征，即，当群集的非零组件的分数表示$ρ$，以及样品数量和固定型固定的比率$ $ $ $ $ lime $α$，同时固定了dimensive diver。我们在下面确定信息理论阈值，在统计上，获得与真聚类均值的正相关是不可能的。此外，我们研究了通过其状态进化分析的近似消息传递（AMP）算法的性能，该算法在此任务中猜想在多项式算法之间是最佳的。我们尤其要确定算法之间存在统计与计算的差距，该算法需要一个信噪比$λ_ {\ text {alg}} \ ge k / \ ge k / \sqrtα$要比随机执行更好的表现，并且信息是$λ_ { ρ\logρ} / \sqrtα$。最后，我们通过将放大器的性能与其他稀疏增强算法（例如稀疏PCA和对角线阈值）进行比较，讨论了超扩张稀疏$ρ$的情况。

A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the ratio $α$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $λ_{\text{alg}} \ge k / \sqrtα $ to perform better than random, and the information theoretic threshold at $λ_{\text{it}} \approx \sqrt{-k ρ\logρ} / \sqrtα$. Finally, we discuss the case of sub-extensive sparsity $ρ$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题