论文标题
离散方法对检测数据集中六种异常的影响
The Impact of Discretization Method on the Detection of Six Types of Anomalies in Datasets
论文作者
论文摘要
异常检测是识别以某种方式识别案例或案例组的过程,并且不符合数据集中存在的一般模式。许多算法在其检测过程中使用数值数据的离散化。这项研究调查了离散方法对数据异常类型中所确认的六种异常类型中每种类型的无监督检测的影响。为此,使用各种数据集和SECODA进行实验,该数据集和具有数值和分类属性的数据集中的无监督非参数异常检测的通用算法。该算法采用连续属性的离散化,指数增加权重和离散点的切割点以及修剪的启发式,以检测具有最佳迭代次数的异常。结果表明,标准SECODA可以检测所有六种类型,但是不同的离散方法有利于发现某些异常类型。主要发现也适用于使用离散化的其他检测技术。
Anomaly detection is the process of identifying cases, or groups of cases, that are in some way unusual and do not fit the general patterns present in the dataset. Numerous algorithms use discretization of numerical data in their detection processes. This study investigates the effect of the discretization method on the unsupervised detection of each of the six anomaly types acknowledged in a recent typology of data anomalies. To this end, experiments are conducted with various datasets and SECODA, a general-purpose algorithm for unsupervised non-parametric anomaly detection in datasets with numerical and categorical attributes. This algorithm employs discretization of continuous attributes, exponentially increasing weights and discretization cut points, and a pruning heuristic to detect anomalies with an optimal number of iterations. The results demonstrate that standard SECODA can detect all six types, but that different discretization methods favor the discovery of certain anomaly types. The main findings also hold for other detection techniques using discretization.