Two-sample comparison through additive tree models for density ratios
arXiv:2508.03059v3 Announce Type: replace-cross
Abstract: The ratio of two densities provides a direct characterization of their differences. We consider the two-sample comparison problem by estimating this ratio given i.i.d. observations from two distributions. To this end, we propose additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss. The loss allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting. Moreover, the balancing loss resembles an exponential family kernel, and it can serve as a pseudo-likelihood with conjugate priors. This property enables generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). Our Bayesian strategy provides uncertainty quantification for the inferred density ratio, which is critical for applications involving high-dimensional and data-limited distributions with potentially substantial uncertainty. We further show connections of the balancing loss to the exponential loss in binary classification and to the variational form of f-divergence, particularly the squared Hellinger distance. Numerical experiments demonstrate that our method achieves both accuracy and computational efficiency, while uniquely providing uncertainty quantification. Finally, we demonstrate its application to assessing the quality of generative models for microbiome compositional data.