基于基于距离的混合型数据的分区方法

论文标题

基于基于距离的混合型数据的分区方法

Benchmarking distance-based partitioning methods for mixed-type data

论文作者

Costa, Efthymios, Papatsouma, Ioanna, Markos, Angelos

论文摘要

聚类的混合型数据，即，由连续变量和分类变量组成的可变数据观察会带来新的挑战。这些挑战中最重要的是选择数据最合适的聚类方法。本文提出了一项基准测试研究，比较了八种基于距离的分区方法，用于混合型数据的群集恢复性能。提出了一系列由完整阶乘设计进行的模拟，这些模拟检查了各种因素对群集恢复的影响。群集重叠的量，数据集中的分类变量的百分比，集群数量和观察次数对群集恢复的影响最大，在大多数测试的情况下。 kamila，k-蛋白型和顺序因子分析和k均值聚类通常比其他方法更好。在选择最合适的方法时，这项研究对于从业者来说可能是有用的参考。

Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题