论文标题
高尺寸的贝叶斯稀疏高斯混合模型
Bayesian Sparse Gaussian Mixture Model in High Dimensions
论文作者
论文摘要
当允许簇数随样本量增长时,我们研究了稀疏的高维高斯混合模型。为参数估计建立了最小值下限,我们表明,受约束的最大似然估计器可实现最小值下限。但是,这种基于优化的估计器在计算上是可悲的,因为目标函数高度非凸,并且可行的集合涉及离散的结构。为了应对计算挑战,我们提出了一种贝叶斯的方法来估计高维高斯混合物,其群集中心使用连续的尖峰和slab先验表现出稀疏性。可以使用易于实现的Gibbs采样器有效地计算后推断。我们进一步证明了所提出的贝叶斯方法的后部收缩率是最小的。使用矩阵扰动理论中的工具作为副产品获得错误的聚类率。提出的贝叶斯稀疏高斯混合物模型不需要预先指定簇的数量,可以通过Gibbs采样器自适应地估计。通过模拟研究和对现实世界单细胞RNA测序数据集的分析来证明所提出方法的有效性和实用性。
We study the sparse high-dimensional Gaussian mixture model when the number of clusters is allowed to grow with the sample size. A minimax lower bound for parameter estimation is established, and we show that a constrained maximum likelihood estimator achieves the minimax lower bound. However, this optimization-based estimator is computationally intractable because the objective function is highly nonconvex and the feasible set involves discrete structures. To address the computational challenge, we propose a Bayesian approach to estimate high-dimensional Gaussian mixtures whose cluster centers exhibit sparsity using a continuous spike-and-slab prior. Posterior inference can be efficiently computed using an easy-to-implement Gibbs sampler. We further prove that the posterior contraction rate of the proposed Bayesian method is minimax optimal. The mis-clustering rate is obtained as a by-product using tools from matrix perturbation theory. The proposed Bayesian sparse Gaussian mixture model does not require pre-specifying the number of clusters, which can be adaptively estimated via the Gibbs sampler. The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing dataset.