二维基于MDL的直方图无监督的离散化

论文标题

二维基于MDL的直方图无监督的离散化

Unsupervised Discretization by Two-dimensional MDL-based Histogram

论文作者

Yang, Lincen, Baratchi, Mitra, van Leeuwen, Matthijs

论文摘要

无监督的离散化是许多知识发现任务的关键步骤。一维数据使用最小描述长度（MDL）原理的局部自适应直方图的最先进方法，但是多维情况的研究要少得多：研究的方法要少得多：当前方法一次考虑一个尺寸（如果不是独立），这会导致基于自适应大小的矩形细胞的离散化。不幸的是，这种方法无法充分表征维度之间的依赖性和/或结果，包括由更多的单元（或垃圾箱）组成的离散化。为了解决这个问题，我们提出了一个表现力的模型类，该类别允许对二维数据进行更灵活的分区。我们扩展了一维情况的艺术状态，以基于归一化最大似然的形式获得模型选择问题。由于我们的模型类的灵活性是以巨大的搜索空间为代价的，因此我们引入了一种启发式算法，名为Palm，该算法将每个维度交替划分，然后使用MDL原理融合相邻区域。合成数据的实验表明，棕榈1）准确地揭示了模型类（即搜索空间）内的地面真相分区，并且给定了足够大的样本量； 2）近似模型类外的各种分区； 3）收敛，与最先进的多元离散方法IPD相比。最后，我们将算法应用于三个空间数据集，我们证明，与内核密度估计（KDE）相比，我们的算法不仅揭示了更详细的密度变化，而且还可以更好地拟合看不见的数据，如原木流离失所性。

Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.

下载PDF全文

下载文献需遵守相关版权规定

论文标题