论文标题

在健康数据库中为预测模型进行基准测试值的方法

Benchmarking missing-values approaches for predictive models on health databases

论文作者

Perez-Lebel, Alexandre, Varoquaux, Gaël, Morvan, Marine Le, Josse, Julie, Poline, Jean-Baptiste

论文摘要

背景:随着数据库越来越大,完全控制其收集变得越来越困难,并且它们经常带有缺失的价值:不完整的观察结果。这些大型数据库非常适合培训机器学习模型,例如预测或在生物医学环境中提取生物标志物。这种预测性方法可以使用歧视性(而不是生成性)建模,从而打开了新的失踪价值策略的大门。然而,现有的处理缺失价值策略的经验评估集中在推论统计上。结果:在这里,我们在预测模型中进行了缺少值策略的系统基准,重点是大型健康数据库:四个电子健康记录数据集,一个人口大脑成像,一项健康调查和两个重症监护。使用梯度增强的树,我们将本地对缺失值的支持与学习之前的简单和最先进的插补进行比较。我们研究预测准确性和计算时间。对于插图后的预测,我们发现添加一个指标表达归类的值很重要,这表明数据并非随机丢失。与简单的策略相比,详细的缺少值插补可以改善预测,但需要更长的计算时间。学习树模型的树木缺失值与缺少的属性铅构成稳健,快速且表现良好的预测建模。结论:对监督机器学习中缺少价值的本地支持比最新的归根结组更好,计算成本要少得多。使用插补时,添加指示列表示归纳哪些值很重要。

BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS: Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values-with missing incorporated attribute-leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS: Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源