论文标题

从部分分类数据中估算分类规则

Estimation of Classification Rules from Partially Classified Data

论文作者

McLachlan, Geoffrey J., Ahfock, Daniel

论文摘要

我们考虑了观察到的样本包含一些观察结果的情况,它们的原产类别是已知的(即它们相对于G基础感兴趣类别的分类),而样本中其余的观察结果未分类(即,其类标签是未知的)。对于以未知参数的向量而被称为阶级条件分布,其目的是估计贝叶斯为随后未分类观测的分配的分配规则。在根据分类和未分类数据的基础上,可以通过最大似然模型(ML)通过EM算法在可以假定观察到的数据是从采用的混合物分布中观察到的随机样品的情况下,以最大似然(ML)的形式拟合G组分混合模型来进行估计。如果鲁宾(Rubin,1976年)的术语中,如果缺失数据机理可忽略,则此假设适用。最初的可能性方法是使用所谓的分类ML方法,从而将缺失的标签视为可以估算的参数以及类 - 条件分布的参数。但是,由于它可能导致不一致的估计值,因此注意力的重点转向了EM算法出现后的混合物ML方法(Dempster等,1977)。此处特别注意从部分分类样本中估计的贝叶斯规则的渐近相对效率(IS)。最后,我们简要考虑了在混合模型的ML估计目的的情况下,缺失标签模式不可忽略的情况。

We consider the situation where the observed sample contains some observations whose class of origin is known (that is, they are classified with respect to the g underlying classes of interest), and where the remaining observations in the sample are unclassified (that is, their class labels are unknown). For class-conditional distributions taken to be known up to a vector of unknown parameters, the aim is to estimate the Bayes' rule of allocation for the allocation of subsequent unclassified observations. Estimation on the basis of both the classified and unclassified data can be undertaken in a straightforward manner by fitting a g-component mixture model by maximum likelihood (ML) via the EM algorithm in the situation where the observed data can be assumed to be an observed random sample from the adopted mixture distribution. This assumption applies if the missing-data mechanism is ignorable in the terminology pioneered by Rubin (1976). An initial likelihood approach was to use the so-called classification ML approach whereby the missing labels are taken to be parameters to be estimated along with the parameters of the class-conditional distributions. However, as it can lead to inconsistent estimates, the focus of attention switched to the mixture ML approach after the appearance of the EM algorithm (Dempster et al., 1977). Particular attention is given here to the asymptotic relative efficiency (ARE) of the Bayes' rule estimated from a partially classified sample. Lastly, we consider briefly some recent results in situations where the missing label pattern is non-ignorable for the purposes of ML estimation for the mixture model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源