Building Reliable Machine Learning Systems for Heart Disease Prediction
Author(s): Puspita Chowdhury Originally published on Towards AI. Image Source: https://www.technologynetworks.com/diagnostics/news/wealth-and-education-play-significant-role-in-heart-disease-risk-396976 Heart disease continues to be the leading cause of death worldwide, responsible for millions of deaths every year. Despite advances in clinical diagnostics, early and accurate detection remains a persistent challenge. Traditional diagnostic procedures are often invasive, expensive, and heavily dependent on physician interpretation. This is where Machine Learning (ML) offers a compelling alternative. In this article, I present a comprehensive machine learning study that benchmarks Deep Learning, Stacking Ensembles, and a Multi-Level Hybrid Stacking Ensemble for heart disease prediction. The results demonstrate that carefully designed ensemble methods outperform standalone deep learning models, achieving state-of-the-art discrimination while maintaining robustness on real-world clinical data. This work is based on our full experimental report and implementation. Why Heart Disease Prediction Is a Perfect ML Problem Heart disease diagnosis relies on a combination of physiological indicators such as: Age Blood pressure Cholesterol levels Exercise-induced angina ECG results These variables interact in nonlinear and complex ways. Humans struggle to reason about these interactions at scale, but machine learning models excel at detecting subtle patterns across multiple dimensions. The goal of this project was not merely to train a classifier, but to answer a deeper question: Which modeling strategy provides the most reliable, generalizable, and clinically useful predictions for heart disease risk? Background and Related Research The UCI Heart Disease dataset has been a benchmark in medical machine learning research for decades. Early studies, including the original work by Detrano et al., achieved approximately 77% accuracy using logistic regression. Over time, more advanced models such as: Support Vector Machines (SVM) K-Nearest Neighbors (KNN) Random Forests Gradient Boosting pushed performance into the 80–88% accuracy range. However, a persistent challenge remains in the literature: Higher accuracy often comes at the cost of interpretability and stability, especially with deep neural networks. This project directly addresses that trade-off by comparing single deep learning models against ensemble-based strategies designed for tabular clinical data. Dataset Overview: From Raw Clinical Records to ML-Ready Features Data Source: I used a consolidated heart disease dataset assembled from four UCI repositories: Cleveland Clinic Foundation Hungarian Institute of Cardiology VA Medical Center (Long Beach) University Hospital, Zurich This produced approximately 920 patient records, making the dataset more diverse and representative than any single source alone. Target Definition The original dataset labeled heart disease severity on a scale from 0 to 4.For clinical relevance and modeling clarity, I converted this into a binary classification task: 0 → No heart disease 1 → Presence of heart disease Data Cleaning and Feature Engineering Medical data is rarely clean — and this dataset was no exception. Missing Values Missing entries marked as “?” were replaced using median imputation, a robust choice that reduces sensitivity to outliers while preserving distributional stability. Advanced Feature Engineering To improve predictive signal, I introduced clinically meaningful composite features: Physiological ratios (e.g., cholesterol-to-max-heart-rate) Severity scores combining chest pain type and exercise-induced angina One-hot encoded categorical features After engineering, the final dataset contained 24 informative features. Scaling and Transformation Standard Scaling ensured stable convergence for neural networks and linear models Quantile Transformation corrected skewed distributions (notably chol and oldpeak) This preprocessing pipeline proved critical for model stability and generalization. Model Architectures Compared I evaluated three fundamentally different modeling strategies, each representing a different philosophy of machine learning. Deep Learning Baseline: Multilayer Perceptron (MLP) The deep learning model served as a strong single-model benchmark. Key characteristics: Feed-forward neural network implemented in PyTorch Funnel-shaped architecture compressing 24 features into abstract representations BCEWithLogitsLoss for numerical stability Adam optimizer with careful regularization Despite extensive tuning (batch normalization, dropout, early stopping), the MLP showed signs of overfitting and instability, a common issue with tabular medical data. Two-Level Stacking Ensemble (MLP Meta-Learner) This architecture introduced ensemble learning while keeping complexity controlled. Level 0: Base Models: Random Forest Gradient Boosting XGBoost LightGBM Each model captures different decision boundaries and inductive biases. Out-of-Fold (OOF) Predictions Using 5-fold cross-validation, base models generated predictions only on unseen folds, eliminating data leakage and ensuring true generalization signals. Level 1: Meta-Learner: A compact MLP learned how to nonlinearly weight and combine the base model predictions. This architecture produced the best AUC-ROC score in the entire study. Multi-Level Hybrid Stacking Ensemble This was the most advanced and computationally expensive architecture. Block Diagram of the Hybrid Stacked Ensemble Level 0 (9 Base Models): Included: Linear models (Logistic Regression, Ridge) Kernel and distance-based models (SVM, KNN) Tree-based ensembles (RF, Extra Trees, GB, XGBoost, LightGBM) Level 1 (Meta Models): Logistic Regression Ridge Classifier LightGBM Each trained on Level 0 OOF predictions. Level 2 (Final Blender): XGBoost, trained on a hybrid feature set combining Level 0 and Level 1 predictions. This design allows the final model to selectively trust both individual learners and refined blends. How the Ensemble Learns: Model Diversity & Trust This section goes beyond reporting performance metrics and aims to explain how the stacking ensemble arrives at its final predictions. Understanding the internal behavior of an ensemble is especially important in medical applications, where trust, robustness, and decision transparency matter as much as raw accuracy. Fig-1 Shows the diversity of your Level 0 base learners and how their individual performance contributes to the final ensemble result The first plot from figure 1 presents the Out-of-Fold (OOF) AUC scores of the Level 0 base models, offering a clear view of model diversity within the ensemble. Each base learner represents a different learning strategy — ranging from linear and distance-based methods to advanced tree-based ensembles. Strong performers such as Random Forest, Gradient Boosting, XGBoost, and LightGBM consistently achieve high OOF AUC scores, indicating their ability to capture complex, non-linear relationships in clinical data. At the same time, simpler models contribute complementary signals that help reduce bias and improve overall generalization. This diversity is a key strength of the ensemble, as it prevents over-reliance on any single modeling assumption. The second plot from figure 1 focuses on the Level 1 meta-models, which are trained to combine the predictions generated by the Level […]