Linear Regression with Unknown Truncation Beyond Gaussian Features
arXiv:2602.12534v1 Announce Type: new
Abstract: In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^star$ is precisely known. The more practically relevant case, where $S^star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{mathrm{poly} (1/varepsilon)}$ run time for achieving $varepsilon$ accuracy.
In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $mathrm{poly} (d/varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.