在基于模型的集群中处理丢失的数据

论文标题

在基于模型的集群中处理丢失的数据

Handling missing data in model-based clustering

论文作者

Serafini, Alessio, Murphy, Thomas Brendan, Scrucca, Luca

论文摘要

高斯混合模型（GMM）是将聚类结构嵌入数据中时的聚类，分类和密度估计的强大工具。缺失值的存在在很大程度上会影响GMMS估计过程，因此处理缺失数据是聚类，分类和密度估计的关键点。已经开发了几种技术来在模型估计之前将缺失值算。其中，多个插补是一种简单有用的一般方法来处理丢失的数据。在本文中，我们提出了两种不同的方法，可以在缺少数据的情况下拟合高斯混合物。两种方法都使用蒙特卡洛预期最大化（MCEM）算法的变体进行数据增强。因此，在E-step期间进行了多次插入，然后为给定的特征构成的组件协方差矩阵进行标准M-步骤。我们表明，在簇识别和密度估计方面，所提出的方法的表现优于多重归合方法。

Gaussian Mixture models (GMMs) are a powerful tool for clustering, classification and density estimation when clustering structures are embedded in the data. The presence of missing values can largely impact the GMMs estimation process, thus handling missing data turns out to be a crucial point in clustering, classification and density estimation. Several techniques have been developed to impute the missing values before model estimation. Among these, multiple imputation is a simple and useful general approach to handle missing data. In this paper we propose two different methods to fit Gaussian mixtures in the presence of missing data. Both methods use a variant of the Monte Carlo Expectation-Maximisation (MCEM) algorithm for data augmentation. Thus, multiple imputations are performed during the E-step, followed by the standard M-step for a given eigen-decomposed component-covariance matrix. We show that the proposed methods outperform the multiple imputation approach, both in terms of clusters identification and density estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题