Space-Efficient Text Indexing with Mismatches using Function Inversion

arXiv:2604.01307v1 Announce Type: new
Abstract: A classic data structure problem is to preprocess a string T of length $n$ so that, given a query $q$, we can quickly find all substrings of T with Hamming distance at most $k$ from the query string. Variants of this problem have seen significant research both in theory and in practice. For a wide parameter range, the best worst-case bounds are achieved by the “CGL tree” (Cole, Gottlieb, Lewenstein 2004), which achieves query time roughly $tilde{O}(|q| + log^k n + # occ)$ where $# occ$ is the size of the output, and space ${O}(nlog^k n)$. The CGL Tree space was recently improved to $O(n log^{k-1} n)$ (Kociumaka, Radoszewski 2026).
A natural question is whether a high space bound is necessary. How efficient can we make queries when the data structure is constrained to $O(n)$ space? While this question has seen extensive research, all known results have query time with unfavorable dependence on $n$, $k$, and the alphabet $Sigma$. The state of the art query time (Chan et al. 2011) is roughly $tilde{O}(|q| + |Sigma|^k log^{k^2 + k} n + # occ)$.
We give an $O(n)$-space data structure with query time roughly $tilde{O}(|q| + log^{4k} n + log^{2k} n # occ)$, with no dependence on $|Sigma|$. Even if $|Sigma| = O(1)$, this is the best known query time for linear space if $kgeq 3$ unless $# occ$ is large. Our results give a smooth tradeoff between time and space. We also give the first sublinear-space results: we give a succinct data structure using only $o(n)$ space in addition to the text itself.
Our main technical idea is to apply function inversion (Fiat, Naor 2000) to the CGL tree. Combining these techniques is not immediate; in fact, we revisit the exposition of both the Fiat-Naor data structure and the CGL tree to obtain our bounds. Along the way, we obtain improved performance for both data structures, which may be of independent interest.

Liked Liked