论文标题
高维数据中的类密度和数据集质量
Class Density and Dataset Quality in High-Dimensional, Unstructured Data
论文作者
论文摘要
我们为类密度提供了一个定义,该定义可用于测量高维,非结构化数据集中每个类中样本的聚合相似性。然后,我们提出了几种候选方法,用于计算类密度,并分析每种方法与相应的单个类测试精度产生的值之间达到的相关性。此外,我们为高维,非结构化数据提出了一个针对数据集质量的定义,并表明那些符合一定质量阈值的数据集(对所研究的数据集的实验证明为> 10)是基于单个类别的冗余数据的候选者。
We provide a definition for class density that can be used to measure the aggregate similarity of the samples within each of the classes in a high-dimensional, unstructured dataset. We then put forth several candidate methods for calculating class density and analyze the correlation between the values each method produces with the corresponding individual class test accuracies achieved on a trained model. Additionally, we propose a definition for dataset quality for high-dimensional, unstructured data and show that those datasets that met a certain quality threshold (experimentally demonstrated to be > 10 for the datasets studied) were candidates for eliding redundant data based on the individual class densities.