论文标题
使用美白方法在高维Logistic回归模型中的可变选择
Variable selection in high-dimensional logistic regression models using a whitening approach
论文作者
论文摘要
在生物信息学中,测序技术的快速发展使我们能够收集越来越多的OMIC数据。基于OMICS数据的分类是生物医学研究中的核心问题之一。但是,OMICS数据通常具有有限的样本量,但特征尺寸很高,并且假定只有少数功能(生物标志物)是活跃的,即可以区分不同类别(例如,癌症亚型,响应者/非响应者对治疗)进行区分的信息。因此,识别有效的分类生物标志物已成为OMIC数据分析的基础。专注于二进制分类,我们提出了一种创新的特征选择方法,旨在处理生物标志物之间的高相关性。各种研究表明,相关生物标志物的臭名昭著的影响以及准确识别活性的生物标志物的困难。我们的方法Wlogit在于白色设计矩阵以删除生物标志物之间的相关性,然后使用适合逻辑回归模型的惩罚标准来选择特征。 WLOGIT的性能是在几种情况下使用合成数据评估的,并与其他方法进行了比较。结果表明,即使在生物标志物高度相关,而其他方法失败的情况下,WLOGIT也可以识别几乎所有活性生物标志物,从而导致更高的分类准确性。还对两种淋巴瘤亚型的分类进行了评估,并且所获得的分类器也优于其他方法。我们的方法在\ texttt {wlogit} r软件包中实现,可从综合R档案网络(CRAN)获得。
In bioinformatics, the rapid development of sequencing technology has enabled us to collect an increasing amount of omics data. Classification based on omics data is one of the central problems in biomedical research. However, omics data usually has a limited sample size but high feature dimensions, and it is assumed that only a few features (biomarkers) are active, i.e. informative to discriminate between different categories (cancer subtypes, responder/non-responder to treatment, for example). Identifying active biomarkers for classification has therefore become fundamental for omics data analysis. Focusing on binary classification, we propose an innovative feature selection method aiming at dealing with the high correlations between the biomarkers. Various research has shown the notorious influence of correlated biomarkers and the difficulty of accurately identifying active ones. Our method, WLogit, consists in whitening the design matrix to remove the correlations between biomarkers, then using a penalized criterion adapted to the logistic regression model to select features. The performance of WLogit is assessed using synthetic data in several scenarios and compared with other approaches. The results suggest that WLogit can identify almost all active biomarkers even in the cases where the biomarkers are highly correlated, while the other methods fail, which consequently leads to higher classification accuracy. The performance is also evaluated on the classification of two Lymphoma subtypes, and the obtained classifier also outperformed other methods. Our method is implemented in the \texttt{WLogit} R package available from the Comprehensive R Archive Network (CRAN).