使用重采样方法堆叠概括性欺诈数据集中的概括

论文标题

使用重采样方法堆叠概括性欺诈数据集中的概括

Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods

论文作者

Kerwin, Kathleen, Bastian, Nathaniel D.

论文摘要

这项研究使用堆叠的概括，这是将机器学习方法（称为元学习者或超级学习者）结合的两步过程，用于提高第一步中算法的性能（通过最大程度地减少每个单独算法的错误率，以减少其在学习集中的偏见），然后在第二步中将结果输入到堆积的融合效果（以较差的融合效果）中，该算法的表现不错（证明了蓝色的成绩（均已提高）。该方法本质上是增强的交叉验证策略。尽管该过程使用了大量的计算资源，但是重新采样的欺诈数据中所得的性能指标表明，增加的系统成本可以证明是合理的。欺诈数据的基本关键是它本质上不是系统的，到目前为止，尚未确定最佳的重新采样方法。构建一个解释算法样本对的所有排列的测试线束表明，复杂的，内在的数据结构均经过彻底的测试。使用对应用堆积概括的欺诈数据的比较分析，为找到用于不平衡欺诈数据集的最佳数学公式所需的有用见解。

This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题