Enhancing Student Retention in Higher Education Institutions (HEIs): Machine Learning Approach
Student dropout remains a critical challenge for higher education institutions, with significant implications for resource allocation, academic planning, and institutional sustainability. This study applies machine learning techniques to predict student non-continuation and attrition, with the objective of supporting data-driven retention strategies. Using a publicly available higher education student dataset (4,424 records, 34 features, multi-class outcome), a structured analytical pipeline was implemented, incorporating Winsorization for outlier mitigation, SMOTE for class imbalance handling, and targeted feature engineering. Model performance was assessed using a 5-fold nested cross-validation framework. Four classifiers, Extra Trees, Random Forest, Gradient Boosting, and Logistic Regression, were trained on an optimized subset of 28 features. Among these, the Extra Trees model achieved the strongest performance, attaining a mean AUC of 0.96 (±0.0053) and an accuracy of 87.4% (±0.012). Model interpretability was enhanced through SHAP analysis, which identified cumulative approved academic units and tuition fee payment status as the most influential predictors of student outcomes. The findings underscore the value of early predictive analytics for informing proactive institutional interventions, particularly in academic monitoring and financial support, to strengthen student retention frameworks.