A Linear Algebraic Proof of the Gauss-Markov Theorem Under Generalized Conditions: Theory and Empirical Application to Clustered Clinical Data
The Gauss-Markov Theorem is central to linear statistical inference, assuring that Ordinary Least Squares (OLS) is the Best Linear Unbiased Estimator (BLUE) under the classical assumptions. But textbook proofs typically involve two assumptions – column full rank of the design matrix and spherical error covariance – that are often violated in practice. In this paper, we provide a single unified proof encompassing both rank deficiency and non-spherical errors using linear algebra: orthogonal projections, the Moore-Penrose pseudoinverse, and the Loewner positive semi-definite matrix order. We sequentially address both of these conditions, culminating in a unified theorem allowing for rank deficient design matrices and arbitrary positive definite error covariances. Our proof takes the form of three lemmas: (1) OLS fitted values are the unique orthogonal projection onto the column space of the design matrix; (2) any alternative linear unbiased estimator is dominated by OLS in the positive semi-definite order; and (3) Generalized Least Squares is BLUE under non-spherical errors by way of a Cholesky decomposition argument. To bridge matrix theory and practical biostatistics, we apply this framework to baseline clinical data from the WHELD dementia study, demonstrating how the pseudoinverse resolves exact multicollinearity and why robust estimators are necessary for non-spherical intra-class correlations. This approach serves as a formal reference for researchers and an illustrative pedagogic tool.