论文标题
平衡几何和密度:高维数据上的路径距离
Balancing Geometry and Density: Path Distances on High-Dimensional Data
论文作者
论文摘要
提出了功率加权最短距离(PWSPD)的新几何和计算分析。通过阐明这些指标在基础数据中平衡密度和几何形状的方式,我们阐明了它们的关键参数,并讨论如何在实践中选择它们。比较与相关数据驱动的指标进行了比较,该指标说明了密度在基于内核的无监督和半监督的机器学习中的更广泛作用。在计算上,我们将完整加权图上的PWSPD与加权最近的邻居图上的类似物相关联,提供了近乎最佳的等效性的高概率保证。开发了与渗透理论的联系,以建立有限样本设置中PWSPD的偏差和方差的估计。理论上的结果通过说明性实验来增强,证明了PWSPD在广泛的数据设置中的多功能性。在整个论文中,我们的结果仅要求从低维歧管中对基础数据进行采样,并至关重要地取决于该歧管的内在维度,而不是其环境维度。
New geometric and computational analyses of power-weighted shortest-path distances (PWSPDs) are presented. By illuminating the way these metrics balance density and geometry in the underlying data, we clarify their key parameters and discuss how they may be chosen in practice. Comparisons are made with related data-driven metrics, which illustrate the broader role of density in kernel-based unsupervised and semi-supervised machine learning. Computationally, we relate PWSPDs on complete weighted graphs to their analogues on weighted nearest neighbor graphs, providing high probability guarantees on their equivalence that are near-optimal. Connections with percolation theory are developed to establish estimates on the bias and variance of PWSPDs in the finite sample setting. The theoretical results are bolstered by illustrative experiments, demonstrating the versatility of PWSPDs for a wide range of data settings. Throughout the paper, our results require only that the underlying data is sampled from a low-dimensional manifold, and depend crucially on the intrinsic dimension of this manifold, rather than its ambient dimension.