论文标题

近似大数据集的持续同源性

Approximating Persistent Homology for Large Datasets

论文作者

Cao, Yueqi, Monod, Anthea

论文摘要

持续的同源性是拓扑数据分析的重要方法,它使理论从代数拓扑调整为数据设置,并已在许多应用中成功实施。它以持久图的形式产生统计摘要,该摘要捕获了数据的形状和大小。尽管使用了广泛使用,但在数据集非常大时,持续的同源性根本无法实现。在本文中,我们解决了为大型数据集找到代表性持久图的问题。我们适应了自举的经典统计方法,即从大数据集中绘制和研究较小的多个子样本。我们表明,子样本的持久图的平均值 - 从子样本中计算出的平均持久度度量 - 是对较大数据集的真实持久同源性的有效近似。我们将平均持久性图的收敛速率从每个子样本的子样本和大小的数量方面。鉴于持续同源性的复杂代数和几何性质,我们适应了持续图图中的凸度和稳定性以及随机集理论,以实现我们的理论结果,以实现点云数据的一般设置。我们在模拟和真实数据上演示了我们的方法,包括在复杂的大规模点云数据上应用形状聚类的应用。

Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源