论文标题
概念漂移检测:通过模糊距离估计处理失踪值
Concept Drift Detection: Dealing with MissingValues via Fuzzy Distance Estimations
论文作者
论文摘要
在数据流中,不同时间点到达观测值的数据分布可能会改变 - 一种称为概念漂移的现象。虽然检测概念漂移是一个相对成熟的研究领域,但仅分离出具有缺失值的观察结果引入的不确定性解决方案。没有人尚未探讨这些解决方案是否可能影响漂移检测性能。但是,我们认为数据插补方法实际上可能会增加数据的不确定性,而不是减少数据。我们还猜测,插补可以将偏见引入估计漂移检测过程中分布变化的过程中,这可能会使训练学习模型更加困难。我们的想法是专注于估计观测值之间的距离,而不是估计缺失值,并根据估计误差定义将观测分配给直方图箱的成员资格函数。我们的解决方案包括一种新颖的掩盖距离学习(MDL)算法,以减少迭代估算观察值中每个缺失值以及模糊加权频率(FWF)方法识别数据分布中差异的方法。本文提出的概念漂移检测算法是一种单数且统一的算法,可以处理缺失值,但不能与概念漂移检测算法相结合的插入算法。合成和现实世界数据集的实验证明了这种方法的优势,并显示了其在检测缺少值的数据漂移方面的鲁棒性。这些发现表明,缺失值对概念漂移检测产生了深远的影响,但是使用模糊集理论来模型观察可以产生比插补更可靠的结果。
In data streams, the data distribution of arriving observations at different time points may change - a phenomenon called concept drift. While detecting concept drift is a relatively mature area of study, solutions to the uncertainty introduced by observations with missing values have only been studied in isolation. No one has yet explored whether or how these solutions might impact drift detection performance. We, however, believe that data imputation methods may actually increase uncertainty in the data rather than reducing it. We also conjecture that imputation can introduce bias into the process of estimating distribution changes during drift detection, which can make it more difficult to train a learning model. Our idea is to focus on estimating the distance between observations rather than estimating the missing values, and to define membership functions that allocate observations to histogram bins according to the estimation errors. Our solution comprises a novel masked distance learning (MDL) algorithm to reduce the cumulative errors caused by iteratively estimating each missing value in an observation and a fuzzy-weighted frequency (FWF) method for identifying discrepancies in the data distribution. The concept drift detection algorithm proposed in this paper is a singular and unified algorithm that can handle missing values, but not an imputation algorithm combined with a concept drift detection algorithm. Experiments on both synthetic and real-world data sets demonstrate the advantages of this method and show its robustness in detecting drift in data with missing values. These findings reveal that missing values exert a profound impact on concept drift detection, but using fuzzy set theory to model observations can produce more reliable results than imputation.