The influence of missing data mechanisms and simple missing data handling techniques on fairness
arXiv:2503.07313v2 Announce Type: replace
Abstract: Machine learning algorithms permeate the day-to-day aspects of our lives and therefore studying the fairness of these algorithms before implementation is crucial. One way in which bias can manifest in a dataset is through missing values. Missing data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use simpler methods of imputation (e.g. mean or mode) compared to more advanced approaches (e.g. multiple imputation). This study considers the fairness of various classification algorithms after a range of missing data handling strategies is applied. Missing values are generated (i.e. amputed) in three popular datasets for classification fairness, by creating a high percentage of missing values using three missing data mechanisms. The results show that the missing data mechanism does not significantly impact fairness; across the missing data handling techniques listwise deletion gives the highest fairness on average and amongst the classification algorithms random forests leads to the highest fairness on average. The interaction effect of the missing data handling technique and the classification algorithm is also often significant.