Multiple Imputation Using Pretrained Foundation Models: A TabPFN-Based Approach

Handling missing data is a central challenge in quantitative research, particularly when datasets exhibit complex dependency structures, such as nonlinear relationships and interactions. Multiple imputation (MI) via fully conditional specification (FCS), as implemented in the mice R package, is widely used but relies on user-specified models that may fail to capture complex dependency structures, especially in high-dimensional settings, or on more sophisticated algorithms that are considered data-hungry. This paper investigates the performance of TabPFN, a transformer-based, pretrained foundation model developed for tabular prediction tasks, for MI. TabPFN is pretrained on millions of synthetic datasets and approximates posterior predictive distributions without dataset-specific retraining, offering a compelling solution for imputing complex missing data in small to moderately sized samples. We conduct a simulation study focusing on univariate missingness in a continuous outcome, comparing TabPFN with standard MI methods. Performance is evaluated using bias, standard error, and coverage of the marginal mean estimand across a range of data-generating – and missingness mechanisms. Our results show that TabPFN yields competitive or superior performance relative to classification and regression trees and predictive mean matching. These findings highlight TabPFN as a promising new tool for missing data imputation, with particular relevance to health research.

Liked Liked