Essential Python Libraries for Data Science
Part 3: Classical Machine Learning

In Part 1, we focused on how data is represented, transformed, and computed using NumPy and Pandas. By the end of that part, the dataset was clean, structured, and numerically stable.
In Part 2, we resisted the urge to jump straight into modeling. Instead, we validated assumptions through visualization and diagnostics. We inspected distributions, examined feature relationships, identified correlation and redundancy, and documented constraints that would influence downstream decisions.
At this point, the data is not just prepared. It is understood.
Only now does it make sense to introduce machine learning.
This is an important distinction. Classical machine learning does not add intelligence to raw data. It formalizes patterns that already exist. When modeling is introduced too early, it amplifies noise, hidden bias, and data quality issues. When introduced at the right time, it becomes a powerful and reliable decision-making layer.
This third part continues in the same notebook, using the same normalized and diagnosed dataset from Parts 1 and part 2. No data is reloaded. No features are redefined. The diagnostic observations already recorded will now influence how models are built, evaluated, and interpreted.
The focus here is classical machine learning for structured, tabular data, using scikit-learn. Not from a competition perspective, and not as a catalog of algorithms, but as a systematic modeling process that prioritizes stability, interpretability, and reproducibility.
We will start with simple, well-understood baselines, build pipelines that reflect real workflows, and evaluate models in a way that aligns with production constraints rather than leaderboard metrics.
By the end of this part, you will not just have trained models. You will have a modeling approach that fits naturally into the data pipeline we have built so far.
In this part, we will:
- Split data correctly while preserving assumptions
- Build preprocessing and modeling pipelines
- Train baseline models for structured data
- Evaluate performance using appropriate metrics
- Establish a foundation for more advanced models in later parts
Each step will extend the same end-to-end example, without resetting or reshaping the workflow.
Transition to Modeling
With data prepared in Part 1 and validated in Part 2, the next step is to introduce machine learning in the most conservative and reliable way possible.
We begin by defining how data flows into a model.
Step 11: Train–Test Split — Defining the Boundary Between Learning and Evaluation
Once data has been prepared and validated, the first modeling decision is not which algorithm to use. It is how to separate what the model is allowed to learn from what it will be evaluated on.
This boundary is critical. A poorly designed train–test split introduces leakage, inflates performance metrics, and creates false confidence that only collapses later in production. A well-designed split enforces discipline and makes model evaluation meaningful.
In real systems, models are always evaluated on future or unseen data. The train–test split is the simplest approximation of that reality.
Because the dataset we are working with is already normalized and diagnostically validated, this step is not about fixing data issues. It is about preserving assumptions and ensuring that evaluation reflects how the model will actually be used.
Why Train–Test Splitting Comes Before Modeling
Many beginners treat the train–test split as a mechanical step. In production environments, it is a design decision.
Key considerations include:
- Preventing information leakage
- Ensuring reproducibility
- Maintaining alignment with downstream evaluation
- Establishing a stable baseline for comparison
Once this split is defined, everything that follows must respect it.
Step 11.1: Defining Features and Target
We continue using the same feature set and target defined earlier. Nothing is reloaded and nothing is redefined.
X = X_normalized_df
y = df["target"]
At this stage:
- X represents normalized numerical features
- y represents the classification target
- Both are already validated and aligned
Step 11.2: Performing the Train–Test Split
We now split the dataset into training and testing subsets. The test set represents unseen data and must remain untouched until evaluation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)
Key points about this split:
- test_size=0.2 reserves 20% of data for evaluation
- random_state=42 ensures reproducibility
- stratify=y preserves class distribution, which is essential for classification problems
Stratification is particularly important in real-world datasets where class imbalance is common. Without it, evaluation metrics can become misleading.
What This Split Guarantees
At this point, we have enforced several important constraints:
- The model will only learn from X_train and y_train
- Evaluation will be performed only on X_test and y_test
- No diagnostic or preprocessing logic will peek into the test set
- Performance metrics will reflect generalization, not memorization
These guarantees are more important than the choice of algorithm that follows.
Why This Matters in Production Systems
In production, models rarely fail because they are mathematically incorrect. They fail because evaluation was optimistic, assumptions were violated, or leakage went unnoticed.
A clean train–test split is the first safeguard against these failures. It establishes trust in every metric computed afterward and provides a stable foundation for comparing models as the system evolves.
Transition to the Next Step
With the learning and evaluation boundary clearly defined, we can now introduce models in a controlled and interpretable way.
Step 12: Baseline Models — Establishing Reference Performance
Once the train–test boundary has been defined, the next step is not to search for the most powerful algorithm. It is to establish baselines.
Baseline models serve a specific purpose in production-grade data science. They answer a simple but critical question:
What level of performance can we achieve with the simplest, most interpretable assumptions?
Without baselines, improvements cannot be measured meaningfully. More complex models may appear to perform well, but without a reference point, it is impossible to know whether that performance is real or accidental.
In structured, tabular data problems, linear models are often the most informative place to start.
Why Baseline Models Matter
Baseline models provide several important guarantees:
- They are fast to train and evaluate
- Their behavior is easy to interpret
- Their limitations are well understood
- They expose data issues early
- They create a stable benchmark for comparison
If a sophisticated model cannot significantly outperform a well-chosen baseline, the additional complexity is rarely justified.
Step 12.1: Logistic Regression as a Baseline Classifier
Because this is a binary classification problem, Logistic Regression is a natural starting point. Despite its simplicity, it performs surprisingly well on many real-world datasets and provides clear signals about feature relevance.
We will use scikit-learn’s implementation with default settings, keeping the model intentionally simple.
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(
max_iter=1000,
random_state=42
)
log_reg.fit(X_train, y_train)

At this stage:
- The model is trained only on training data
- No hyperparameter tuning is performed
- The goal is reference performance, not optimization
Step 12.2: Evaluating the Baseline Model
Evaluation must be performed strictly on the test set. Any evaluation on training data would overstate performance and undermine the purpose of the baseline.
from sklearn.metrics import accuracy_score
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

Accuracy provides a quick sanity check, but it should never be the only metric considered. It answers how often predictions are correct, not how or why they fail.
Step 12.3: Confusion Matrix for Error Structure
To understand model behavior more deeply, we examine the confusion matrix. This reveals the types of errors the model makes and whether those errors are symmetric.
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

This matrix highlights:
- False positives vs false negatives
- Whether one class is systematically harder to predict
- Potential business implications of misclassification
These patterns often matter more than raw accuracy.
Step 12.4: Classification Report for Detailed Metrics
Finally, we examine precision, recall, and F1-score to understand performance per class.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

This report provides a more complete picture of model behavior and often exposes trade-offs hidden by aggregate metrics.
What This Baseline Tells Us
At this point, we have established a clear reference point:
- Performance achieved with minimal assumptions
- Error patterns that reflect data structure
- A model that is easy to interpret and debug
This baseline becomes the yardstick against which all future models will be measured.
Why This Step Is Critical in Production
In production systems, baseline models are often retained even after more advanced models are deployed. They serve as:
- Fallback options
- Sanity checks during retraining
- Reference points during model drift analysis
Skipping baselines removes this safety net.
Transition to the Next Step
With a baseline model in place, we can now focus on structure rather than algorithms.
In the next step, we will introduce scikit-learn pipelines to formalize preprocessing and modeling as a single, reproducible unit.
Step 13: Modeling Pipelines — Structuring Training and Evaluation
Once a baseline model is established, the next priority is not improving performance. It is improving structure.
In exploratory work, it is common to see preprocessing steps applied manually before model training. In production systems, this approach quickly becomes fragile. Transformations drift, evaluation becomes inconsistent, and retraining pipelines break in subtle ways.
Modeling pipelines exist to prevent this.
A pipeline enforces a single, repeatable path from raw input to prediction. Every transformation applied during training is guaranteed to be applied in the same way during evaluation and inference. This consistency is what turns a model into a system component rather than a one-off experiment.
Why Pipelines Matter More Than Algorithms
In real deployments, models fail far more often due to process issues than algorithmic limitations.
Pipelines address several recurring problems:
- Inconsistent preprocessing between training and inference
- Accidental leakage from test data
- Difficulty reproducing results
- Fragile retraining workflows
Once pipelines are in place, models become easier to reason about, easier to validate, and easier to operate over time.
Step 13.1: Introducing the scikit-learn Pipeline
scikit-learn provides a native Pipeline abstraction that allows preprocessing and modeling steps to be chained together into a single object.
Even when preprocessing is minimal, using a pipeline early establishes good discipline.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
Step 13.2: Defining a Simple Modeling Pipeline
In our case, the data is already numerically normalized. That does not eliminate the need for a pipeline. It simply means the pipeline is currently lightweight.
pipeline = Pipeline(
steps=[
("model", LogisticRegression(
max_iter=1000,
random_state=42
))
]
)
This pipeline explicitly defines:
- A single modeling step
- All model configuration in one place
- A reusable object that encapsulates training logic
As the system evolves, additional steps can be added without changing downstream code.
Step 13.3: Training the Pipeline
Training the pipeline looks identical to training a model, but the semantics are different. The pipeline now owns the full transformation and modeling process.
pipeline.fit(X_train, y_train)

This single call ensures that:
- All steps are fit only on training data
- No accidental leakage occurs
- The training process is reproducible
Step 13.4: Evaluating the Pipeline
Evaluation also remains unchanged at the surface level, which is a key advantage of pipelines.
y_pred_pipeline = pipeline.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_pipeline)

Because the pipeline encapsulates all logic, the same object can later be used for inference without modification.
Step 13.5: Why Pipelines Scale Well
As systems grow, pipelines become even more valuable. They support:
- Adding preprocessing steps safely
- Swapping models without refactoring code
- Cross-validation without leakage
- Consistent retraining and deployment
In regulated or long-lived systems, pipelines are often a requirement rather than a convenience.
Pipelines as System Boundaries
A useful way to think about pipelines is as contract boundaries. Everything inside the pipeline is part of the learning system. Everything outside is orchestration.
This separation simplifies:
- Auditing
- Debugging
- Monitoring
- Change management
It also makes the system more resilient to future changes.
Transition to the Next Step
With pipelines in place, we can now evaluate models more rigorously and compare alternatives without changing process.
In the next step, we will focus on model evaluation beyond accuracy, examining metrics that reveal error structure and business impact.
Step 14: Model Evaluation — Metrics That Actually Matter
Once a model has been trained through a clean pipeline, the next question is not whether it achieves a high score, but what that score actually means.
In many projects, evaluation stops at a single metric, often accuracy. In production systems, this is rarely sufficient. A model can achieve high accuracy while still making costly or unacceptable errors. Evaluation must therefore expose error structure, not just aggregate performance.
This step focuses on understanding how the model fails, not just how often it succeeds.
Why Accuracy Alone Is Not Enough
Accuracy answers one question: What fraction of predictions were correct?
It does not answer:
- Which class is harder to predict
- Whether false positives or false negatives dominate
- How performance varies across decision thresholds
- Whether errors align with business risk
In regulated or high-impact systems, these distinctions matter more than raw accuracy.
Step 14.1: Reusing Pipeline Predictions
We continue evaluating the pipeline model introduced in Step 13. No new models are trained and no parameters are changed.
y_pred = pipeline.predict(X_test)
This ensures that evaluation reflects the exact system that would be deployed.
Step 14.2: Confusion Matrix — Understanding Error Types
The confusion matrix is one of the most important diagnostic tools in classification. It shows how predictions are distributed across true classes.
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

This matrix reveals:
- How often the model confuses one class for another
- Whether errors are balanced or skewed
- Which mistakes are most common
In real systems, these patterns often map directly to business outcomes.
Step 14.3: Precision, Recall, and F1-Score
To go beyond raw counts, we examine class-level performance using precision, recall, and F1-score.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

These metrics provide deeper insight:
- Precision measures how reliable positive predictions are
- Recall measures how well the model captures actual positives
- F1-score balances the two
Different applications prioritize these metrics differently. Understanding the trade-offs is more important than optimizing any single value.
Step 14.4: Probability Scores and Decision Thresholds
Many models, including logistic regression, produce probability estimates rather than hard class labels. These probabilities allow decision thresholds to be adjusted based on risk tolerance.
y_proba = pipeline.predict_proba(X_test)[:, 1]
Working with probabilities enables:
- Threshold tuning
- Risk-based decisioning
- Better alignment with business constraints
This is especially important when the cost of different error types is asymmetric.
Step 14.5: ROC Curve and AUC
The ROC curve visualizes the trade-off between true positive rate and false positive rate across thresholds.
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = roc_auc_score(y_test, y_proba)
roc_auc

AUC provides a threshold-independent measure of separability. While it should not be used in isolation, it is useful for comparing models under consistent conditions.
Interpreting Evaluation Results in Context
Evaluation metrics do not exist in a vacuum. Their meaning depends on:
- Data distribution
- Class imbalance
- Business risk
- Regulatory constraints
A model that is statistically strong but operationally risky is not production-ready. Evaluation is where these tensions surface.
Why This Step Matters in Production
In production systems:
- Metrics guide deployment decisions
- Metrics define alert thresholds
- Metrics drive retraining schedules
- Metrics support audits and reviews
Poor evaluation leads to fragile systems, regardless of model sophistication.
Transition to the Next Step
At this point, we have:
- A validated dataset
- A reproducible pipeline
- A baseline model
- A meaningful evaluation framework
In the next step, we will bring these elements together and discuss model selection, comparison, and readiness for more advanced approaches, setting the stage for Part 4.
Step 15: Model Comparison and Readiness — Knowing When to Move Forward
By this point, we have done everything that classical machine learning requires to be credible.
We did not rush into complex algorithms. We built a clean data foundation, validated assumptions, enforced strict training boundaries, established baselines, structured the workflow with pipelines, and evaluated models using metrics that expose real behavior.
Now comes a decision that many projects get wrong: whether to move forward, and how.
Model comparison is not about picking the highest score. It is about deciding whether additional complexity is justified, safe, and aligned with system constraints.
What We Are Comparing Against
Our baseline model provides a reference point with several important properties:
- It is interpretable
- It is stable
- It is fast to train and retrain
- Its failure modes are easy to understand
- Its performance is measurable and reproducible
Any model that follows must justify itself relative to this baseline, not in isolation.
When a Baseline Is “Good Enough”
In many real-world systems, baseline models remain in production for years. This is not a failure of ambition. It is a reflection of trade-offs.
A baseline may be sufficient when:
- Performance meets business requirements
- Errors are well understood and acceptable
- Latency and throughput constraints are tight
- Explainability is mandatory
- Retraining and monitoring need to be simple
In these cases, complexity adds risk without meaningful upside.
When It Makes Sense to Go Further
More advanced models become relevant when:
- Baseline performance is clearly insufficient
- Error patterns indicate nonlinear structure
- Feature interactions matter
- Business impact justifies additional complexity
- Monitoring and governance capabilities are in place
This decision should be driven by evidence, not by algorithm popularity.
Readiness Checklist Before Moving On
Before introducing more sophisticated models, the system should satisfy the following:
- A stable baseline exists
- Evaluation metrics are trusted
- Error behavior is understood
- Pipelines are in place
- Retraining is reproducible
- Monitoring requirements are clear
If any of these are missing, adding complexity will amplify existing weaknesses.
Why This Step Matters More Than Any Algorithm
Many production failures happen not because models are weak, but because teams move forward prematurely. They add complexity before establishing control.
This step enforces restraint.
It ensures that progress is deliberate and reversible, not reactive.
Closing Thoughts for Part 3
Classical machine learning is not outdated. It remains the backbone of most production systems built on structured, tabular data. What determines success is not the algorithm itself, but the process around it: disciplined data handling, careful evaluation, and awareness of real operational constraints.
By the end of this part, we have not simply trained models. We have established a modeling approach that is stable, interpretable, and designed to evolve safely as systems grow in complexity. This foundation is what allows teams to move forward with confidence rather than chasing incremental gains without control.
If this way of thinking reflects how you approach data science in real systems, consider clapping to signal that this perspective was useful, leaving a comment to share how you handle baselines and evaluation in your own work, or following the series to continue from classical modeling into more advanced production techniques. Sharing the article with others working on tabular ML systems also helps extend the discussion beyond a single post. Thanks !!!
Transition to Part 4
With a solid classical modeling foundation in place, we are now ready to explore gradient boosting in production.
In Part 4, we will examine why models such as XGBoost, LightGBM, and CatBoost dominate tabular machine learning, and how they are tuned, evaluated, and governed in real-world systems where performance, explainability, and stability all matter.
Essential Python Libraries for Data Science was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.