Information-Geometric Decomposition of Generalization Error in Unsupervised Learning
arXiv:2604.12340v1 Announce Type: new
Abstract: We decompose the Kullback–Leibler generalization error (GE) — the expected KL divergence from the data distribution to the trained model — of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $epsilon$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $epsilon$. Although rank-constrained $epsilon$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $lambda_{mathrm{cut}}^{*} = epsilon$ — the model retains exactly those empirical eigenvalues exceeding the noise floor — with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram — retain-all, interior, and collapse — separated by the lower Marchenko–Pastur edge and an analytically computable collapse threshold $epsilon_{*}(alpha)$, where $alpha$ is the dimension-to-sample-size ratio. All claims are verified numerically.