论文标题
追求建模集群人的异质性来源
Pursuing Sources of Heterogeneity in Modeling Clustered Population
论文作者
论文摘要
在数据爆炸时代,研究人员通常必须应对具有混合回归关系的异质人群。在此类问题中,当有许多候选预测因素时,不仅要确定与结果相关的预测因子,而且还可以区分异质性的真实来源,即确定群集之间具有不同影响的预测因子,因此是对簇形成的真正贡献者。我们阐明了异质性来源的概念,这些概念解释了群集的潜在规模差异,并提出了正则化的有限混合效应回归,以同时实现异质性追求和特征选择。顾名思义,该问题是在效应模型参数化下提出的,其中群集标签丢失了,每个预测因子对结果的影响分解为一个共同的效应项和一组集群特定的项。这些影响的稀疏估计受到限制导致对两个变量的鉴定,具有共同作用和具有异质效应的变量。我们提出了一种有效的算法,并表明我们的方法可以达到估计和选择一致性。模拟研究进一步证明了我们在各种实际情况下方法的有效性。提出了三种应用,即,一项成像遗传学研究,用于在阿尔茨海默氏病中联系遗传因素和大脑神经影像特征,这是一项公共卫生研究,用于探索青少年自杀风险与学区特征之间的自杀风险之间的关联,以及一项体育分析研究,以了解棒球运动员的薪资水平与他们的绩效和合同状态相关。
Researchers often have to deal with heterogeneous population with mixed regression relationships, increasingly so in the era of data explosion. In such problems, when there are many candidate predictors, it is not only of interest to identify the predictors that are associated with the outcome, but also to distinguish the true sources of heterogeneity, i.e., to identify the predictors that have different effects among the clusters and thus are the true contributors to the formation of the clusters. We clarify the concepts of the source of heterogeneity that account for potential scale differences of the clusters and propose a regularized finite mixture effects regression to achieve heterogeneity pursuit and feature selection simultaneously. As the name suggests, the problem is formulated under an effects-model parameterization, in which the cluster labels are missing and the effect of each predictor on the outcome is decomposed to a common effect term and a set of cluster-specific terms. A constrained sparse estimation of these effects leads to the identification of both the variables with common effects and those with heterogeneous effects. We propose an efficient algorithm and show that our approach can achieve both estimation and selection consistency. Simulation studies further demonstrate the effectiveness of our method under various practical scenarios. Three applications are presented, namely, an imaging genetics study for linking genetic factors and brain neuroimaging traits in Alzheimer's disease, a public health study for exploring the association between suicide risk among adolescents and their school district characteristics, and a sport analytics study for understanding how the salary levels of baseball players are associated with their performance and contractual status.