缓解健康数据贫困：生成方法与时间序列临床数据的重新采样

论文标题

缓解健康数据贫困：生成方法与时间序列临床数据的重新采样

Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data

论文作者

Marchesi, Raffaele, Micheletti, Nicolo, Jurman, Giuseppe, Osmani, Venet

论文摘要

已经开发出了几种方法来减轻健康数据贫困引起的算法偏见，在培训数据集中，少数群体的人数不足。由于算法的简单性，使用重新采样（例如SMOTE）增强少数群体（例如SMOTE）是一种广泛使用的方法。但是，这些算法降低了数据可变性，并可能引入样品之间的相关性，从而引起基于GAN的生成方法的使用。生成高维，时间序列的真实数据，可为真实数据提供广泛的分布覆盖范围，仍然是重新采样和基于GAN的方法的挑战性任务。在这项工作中，我们提出了CA-GAN架构，以解决当前方法的某些缺点，其中我们使用高维，时间序列，3343个催眠性高语和黑人患者的真实数据集提供了与Smote和Wgan-GP*的详细比较。我们表明，我们的方法在生成少数族裔类的真实数据以及保持真实数据的原始分布之内都更好。

Several approaches have been developed to mitigate algorithmic bias stemming from health data poverty, where minority groups are underrepresented in training datasets. Augmenting the minority class using resampling (such as SMOTE) is a widely used approach due to the simplicity of the algorithms. However, these algorithms decrease data variability and may introduce correlations between samples, giving rise to the use of generative approaches based on GAN. Generation of high-dimensional, time-series, authentic data that provides a wide distribution coverage of the real data, remains a challenging task for both resampling and GAN-based approaches. In this work we propose CA-GAN architecture that addresses some of the shortcomings of the current approaches, where we provide a detailed comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-series, real dataset of 3343 hypotensive Caucasian and Black patients. We show that our approach is better at both generating authentic data of the minority class and remaining within the original distribution of the real data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题