基于弱监督的高度不平衡和高维度数据分类的过采样

论文标题

基于弱监督的高度不平衡和高维度数据分类的过采样

Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

论文作者

Qian, Min, Li, Yan-Fu

论文摘要

借助大量的工业数据集，在几个应用程序域中，分类不平衡已成为一个普遍的问题。过采样是解决分类不平衡的有效方法。现有的过采样方法的主要挑战之一是准确标记新的合成样本。合成样品的不准确标签会扭曲数据集的分布，并可能恶化分类性能。本文介绍了弱监督学习的想法，以处理由传统的过采样方法引起的合成样本的不准确标记。图形半监督SMOTE的开发是为了提高合成样品标签的可信度。此外，我们为高维数据集和基于自举的集合框架提出了对成本敏感的社区组件分析，以实现高度不平衡的数据集。提出的方法已在8个合成数据集和3个现实世界数据集上实现了良好的分类性能，尤其是对于高度不平衡和高维问题。平均性能和鲁棒性优于基准方法。

With the abundance of industrial datasets, imbalanced classification has become a common problem in several application domains. Oversampling is an effective method to solve imbalanced classification. One of the main challenges of the existing oversampling methods is to accurately label the new synthetic samples. Inaccurate labels of the synthetic samples would distort the distribution of the dataset and possibly worsen the classification performance. This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples caused by traditional oversampling methods. Graph semi-supervised SMOTE is developed to improve the credibility of the synthetic samples' labels. In addition, we propose cost-sensitive neighborhood components analysis for high dimensional datasets and bootstrap based ensemble framework for highly imbalanced datasets. The proposed method has achieved good classification performance on 8 synthetic datasets and 3 real-world datasets, especially for high imbalance and high dimensionality problems. The average performances and robustness are better than the benchmark methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题