Approximating Persistent Homology for Large Datasets
arXiv:2204.09155v3 Announce Type: replace
Abstract: Persistent homology is an important methodology in topological data analysis which adapts theory from algebraic topology to data settings. Computing persistent homology produces persistence diagrams, which have been successfully used in diverse domains. Despite its widespread use, persistent homology is simply impossible to compute when a dataset is very large. We study a statistical approach to the problem of computing persistent homology for massive datasets using a multiple subsampling framework and extend it to three summaries of persistent homology: H”{o}lder continuous vectorizations of persistence diagrams; the alternative representation as persistence measures; and standard persistence diagrams. Specifically, we derive finite sample convergence rates for empirical means for persistent homology and practical guidance on interpreting and tuning parameters. We validate our approach through extensive experiments on both synthetic and real-world data. We demonstrate the performance of multiple subsampling in a permutation test to analyze the topological structure of Poincar'{e} embeddings of large lexical databases.