告诉我一些我不知道的事情：迭代数据挖掘的随机策略

论文标题

告诉我一些我不知道的事情：迭代数据挖掘的随机策略

Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

论文作者

Hanhijärvi, Sami, Ojala, Markus, Vuokko, Niko, Puolamäki, Kai, Tatti, Nikolaj, Mannila, Heikki

论文摘要

有多种可用的数据挖掘方法，通常在探索性数据分析中使用许多不同的方法来对同一数据集使用许多不同的方法。但是，这导致了一个问题，即通过一种方法找到的结果是否反映了另一种方法结果所示的现象，或者结果是否在某种意义上描绘了数据的属性。例如，使用聚类可以指示清晰的群集结构，并且变量之间的计算相关性可以表明数据中存在许多显着的相关性。但是，可以实际上由群集结构确定相关性。在本文中，我们考虑了随机数据的问题，以便考虑到先前发现的模式或模型。随机化方法可用于迭代数据挖掘。在数据挖掘过程的每个步骤中，随机化从满足已经发现的模式或模型的一组数据矩阵中产生随机样品。也就是说，给定数据集以及数据集的一些统计数据（例如群集中心或共发生计数），随机方法样本数据集的给定统计数据集与原始数据集具有相似的值。我们使用基于本地掉期的大都市抽样来实现这一目标。我们描述了实际数据的实验，这些实验证明了我们方法的有用性。我们的结果表明，在许多情况下，例如，聚类的结果实际上意味着频繁发现的结果。

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

下载PDF全文

下载文献需遵守相关版权规定

论文标题