有条件的Wasserstein Gan基于表格数据的过采样，以实现不平衡学习

论文标题

有条件的Wasserstein Gan基于表格数据的过采样，以实现不平衡学习

Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning

论文作者

Engelmann, Justin, Lessmann, Stefan

论文摘要

阶级失衡是监督学习中的一个常见问题，并阻碍了分类模型的预测性能。流行的对策包括过度采样少数群体。诸如SMOTE之类的标准方法依赖于找到最近的邻居和线性插值，这些插值在高维，复杂的数据分布的情况下是有问题的。已经提出了生成对抗网络（GAN）作为生成人工少数群体的替代方法，因为它们可以对复杂的分布进行建模。但是，先前对基于GAN的过采样的研究并未纳入文献中有关用GAN生成逼真的表格数据的最新进步。先前的研究还侧重于数值变量，而分类特征在分类方法（例如信用评分）的许多业务应用中很常见。该论文提出了一种基于条件的Wasserstein Gan的过采样方法，该方法可以有效地对具有数值和分类变量的表格数据集进行建模，并通过辅助分类器损失特别注意下游流的分类任务。我们根据七个现实世界数据集对标准的过采样方法和不平衡的基准进行基准测试。经验结果证明了基于GAN的过采样的竞争力。

Class imbalance is a common problem in supervised learning and impedes the predictive performance of classification models. Popular countermeasures include oversampling the minority class. Standard methods like SMOTE rely on finding nearest neighbours and linear interpolations which are problematic in case of high-dimensional, complex data distributions. Generative Adversarial Networks (GANs) have been proposed as an alternative method for generating artificial minority examples as they can model complex distributions. However, prior research on GAN-based oversampling does not incorporate recent advancements from the literature on generating realistic tabular data with GANs. Previous studies also focus on numerical variables whereas categorical features are common in many business applications of classification methods such as credit scoring. The paper propoes an oversampling method based on a conditional Wasserstein GAN that can effectively model tabular datasets with numerical and categorical variables and pays special attention to the down-stream classification task through an auxiliary classifier loss. We benchmark our method against standard oversampling methods and the imbalanced baseline on seven real-world datasets. Empirical results evidence the competitiveness of GAN-based oversampling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题