论文标题

高维生物学数据中的因子分析,并具有依赖观测

Factor analysis in high dimensional biological data with dependent observations

论文作者

McKennan, Chris

论文摘要

因子分析是高维生物学数据分析的关键组成部分。但是,现代生物学数据包含两个关键特征,这些功能可腐蚀现有方法。首先,这些数据包括纵向,多处理和多组织数据,包含破坏利用盛行方法所必需的关键独立性要求的样本。其次,生物学数据包含具有较大,中和小信号强度的因素,因此违反了多种方法执行至关重要的无处不在的“普遍因素”假设。在这项工作中,我开发了一个新颖的统计框架来执行因子分析,并在具有依赖观测值和信号强度范围数量数量级的数据的数据中解释其结果。然后,我证明我的方法可以用来解决许多在分析依赖生物学数据时常规出现的重要和以前未解决的问题,包括高维协方差估计,子空间恢复,潜在因素解释和数据deNosising。此外,我表明我对因素数量的估计值克服了臭名昭著的“特征值阴影”问题,以及由于遇到现有估计值的普遍因素假设而引起的偏见。模拟和真实数据证明了我的方法在实践中的出色表现。

Factor analysis is a critical component of high dimensional biological data analysis. However, modern biological data contain two key features that irrevocably corrupt existing methods. First, these data, which include longitudinal, multi-treatment and multi-tissue data, contain samples that break critical independence requirements necessary for the utilization of prevailing methods. Second, biological data contain factors with large, moderate and small signal strengths, and therefore violate the ubiquitous "pervasive factor" assumption essential to the performance of many methods. In this work, I develop a novel statistical framework to perform factor analysis and interpret its results in data with dependent observations and factors whose signal strengths span several orders of magnitude. I then prove that my methodology can be used to solve many important and previously unsolved problems that routinely arise when analyzing dependent biological data, including high dimensional covariance estimation, subspace recovery, latent factor interpretation and data denoising. Additionally, I show that my estimator for the number of factors overcomes both the notorious "eigenvalue shadowing" problem, as well as the biases due to the pervasive factor assumption that plague existing estimators. Simulated and real data demonstrate the superior performance of my methodology in practice.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源