论文标题
丢失的指示器方法:从低到高维
The Missing Indicator Method: From Low to High Dimensions
论文作者
论文摘要
缺少数据在应用数据科学中很常见,特别是对于医疗保健,社会科学和自然科学中的表格数据集。大多数监督的学习方法仅在完整的数据上起作用,因此需要进行预处理,例如缺少价值插补以处理不完整的数据集。但是,仅插补并不是编码有关丢失值本身的有用信息。对于具有信息丢失模式的数据集,添加指示变量以指示丢失模式的丢失指示方法(MIM)可以与插补一起使用以提高模型性能。虽然通常用于数据科学,但从经验,尤其是理论的角度来看,MIM被忽略了。在本文中,我们从经验和理论上表明MIM提高了信息缺失值的性能,并且我们证明MIM不会渐近地损害线性模型,而对于非信息性缺失值。此外,我们发现,对于具有许多非信息指标的高维数据集,MIM可以诱导模型过度拟合,从而测试性能。为了解决这个问题,我们介绍了选择性MIM(SMIM),这是一种新颖的MIM扩展,仅添加了缺少指标的指示器,这些指标仅适用于具有信息丢失模式的功能。从经验上,我们表明SMIM至少与MIM相同,并改善了高维数据的MIM。最后,为了证明MIM在现实世界数据科学任务上的实用性,我们证明了MIM和SMIM对电子健康记录模拟III数据库产生的临床任务的有效性。
Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.