Data Leakage in Machine Learning: Why You Must Split Before Preprocessing

Why splitting your data after preprocessing can silently invalidate your entire evaluation

Imagine your model shows 95% accuracy during evaluation. You deploy it to production, and performance immediately drops to 60%. What went wrong?

In most cases, the culprit is Data Leakage. It occurs when information from the future (the test set) leaks into the past (the training process), giving the model a “cheat sheet” it won’t have in the real world. This mistake is surprisingly common in tutorials and beginner projects, and it can silently invalidate your results.

In this article, we’ll clarify:

  • Why you must split before preprocessing
  • What counts as learning from data
  • The correct ML workflow
  • Code examples (wrong vs correct)
  • How this applies to cross-validation

The Golden Rule: The Test Set Must Remain Unseen

The test set represents future unseen data. If any preprocessing step, such as scaling or imputation, uses information from the test set, your model indirectly “sees the future,” leading to inflated evaluation scores. This problem is known as Data Leakage — the use of information from outside the training dataset to train the model. Even simple preprocessing steps can cause leakage.

Think of the test set as tomorrow’s exam paper. You are allowed to study from past material (the training data), but you are not allowed to look at tomorrow’s paper while preparing. If you secretly look at the exam questions in advance, your score may look impressive, but it does not reflect real understanding. Data leakage works the same way. The model appears accurate, but only because it has already seen information it should not have.

What Counts as Learning From Data?

Any step that computes statistics or extracts patterns from the dataset is learning from it. This includes:

1. Data Scaling & Normalization.

To scale data, you must first understand its range or distribution.

  • Standardization: Calculating the mean (μ) and standard deviation (𝛔) to center the data.
  • Min-Max Scaling: Identifying the absolute min and max values of a feature.

Why it’s learning: If your test set contains the “true” maximum value and you include it in your scaling calculation, your training data is being squashed or stretched based on information it shouldn’t have.

2. Handling Missing Values (Imputation).

Filling in the blanks requires a best guess based on existing data.

  • Statistical Imputation: Calculating the mean, median, or mode of a column.
  • Model-based Imputation: Using K-Nearest Neighbors (KNN) to predict missing values based on surrounding points.

Why it’s learning: You are using information from the whole dataset to fill in missing values. If the test set has a different average than the training set, then the imputer is indirectly using information from the test set. This means the model has seen some information from the test data before it should have.

3. Dimensionality Reduction (PCA).

  • Principal Component Analysis (PCA): This algorithm finds the axes of maximum variance.

Why it’s learning: PCA is an unsupervised learning algorithm. It builds a map of the data’s structure. If the test data is used to build that map, the transformed features are already aware of the test set’s spread.

4. Feature Engineering & Selection.

  • Frequency/Target Encoding: Replacing categories with their frequency or the average target value.
  • Correlation Analysis: Selecting features based on how highly they correlate with the target variable.

Why it’s learning: Target encoding is particularly dangerous; it directly injects the answer (the target) into the features.

Target leakage occurs when information derived from the target variable is used during preprocessing before splitting the data. This often happens accidentally during feature selection, correlation analysis, or target encoding, where the model indirectly gains access to the answer it is supposed to predict.

5. Resampling (SMOTE)

  • Oversampling: Analyzing the density and boundaries of the minority class to generate synthetic examples.

Why it’s learning: Synthetic points are created based on the neighbors around them. If you oversample before splitting, you might create synthetic points in the training set that are near-duplicates of points that end up in your test set.

If these steps are performed before splitting, the test set influences the training process.

What Happens If You Preprocess Before Splitting?

Consider scaling with StandardScaler.

Suppose:

Training data mean = 5 & Test data mean = 10

If you scale the entire dataset before splitting, the scaler uses combined statistics. That means your training process already incorporates information from the test set.

The result is:

  • Unrealistically high accuracy
  • Poor real-world generalization
  • Misleading evaluation metrics

❌ The Wrong Approach (Global Scaling)

By fitting the scaler on the whole dataset, the training data “knows” the global range of the test data.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()

# LEAKAGE: Scaler learns parameters from the WHOLE dataset
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

✅ The Correct Approach (The Wall Strategy)

You must build a wall between your training and testing data before a single calculation is made.

# 1. Split first
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 2. Fit ONLY on training
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# 3. Apply (transform) to test using training stats
X_test = scaler.transform(X_test)

Now, the test set remains completely unseen during training.

What Is Safe Before Splitting?

Some preprocessing steps are safe to perform before splitting because they do not learn from the data distribution. These include:

  • Removing irrelevant columns
  • Dropping duplicates
  • Fixing formatting errors
  • Renaming features

These operations do not calculate statistics, extract patterns, or analyze relationships in the data.

The key rule: if a step does not compute statistics from the dataset’s distribution or learn patterns from it, then it is safe to perform before splitting. If the step involves calculating a mean, standard deviation, correlation, frequency, or any learned structure, it must happen after the split.

The Correct Machine Learning Workflow

Cross-Validation: A Common Confusion.

When using k-fold cross-validation, preprocessing must happen inside each fold. If preprocessing is done before cross-validation, leakage still occurs. The safest way to handle this is by using a Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])

scores = cross_val_score(pipeline, X, y, cv=5)

Pipeline ensures that scaling happens only on the training portion of each fold automatically.

Real-World Example: Audio Emotion Recognition.

Consider an emotion detection project using audio files. Correct workflow:

1. Split WAV files into train/test sets

2. Extract MFCC features from training data

3. Compute normalization statistics from training features

4. Apply the same normalization to test features

5. Train classifier

6. Evaluate

If you normalize before splitting, statistics from test audio leak into training, invalidating your evaluation.

Advanced “Silent” Leakage Killers

The Time-Travel Problem: If your data is a time-series, a random train_test_split allows the model to “predict the past” using knowledge from the “future.”

Solution: Use TimeSeriesSplit. Always ensure your training window ends before your validation window begins.

The Group Leakage Problem

If your dataset contains multiple rows for the same entity (e.g., five photos of the same patient), a random split might put three photos in training and two in testing. The model isn’t learning disease; it’s memorizing that specific patient.

Solution: Use GroupKFold to ensure all data from one entity stays in the same split.

Common Beginner Mistakes

  • Scaling before splitting
  • Applying SMOTE before splitting
  • Performing feature selection on the entire dataset
  • Using the target variable during preprocessing
  • Doing PCA before cross-validation

These mistakes may not cause visible errors, but they silently distort results.

The Final Model Audit Checklist.

Before you deploy or publish, run your pipeline through these checks:

  • Pre-Processing .fit() Check: Ensure .fit() is never called on the raw, unsplit dataset.
  • Imputation Isolation: Are missing values being filled based only on the training set’s mean or median?.
  • Pipeline Implementation: Is sklearn.pipeline.Pipeline used for cross-validation to prevent leakage within folds automatically?.
  • Resampling Check: Is SMOTE or oversampling applied after the split to avoid near-duplicate data leakage?.
  • Temporal Logic: If the data is time-sensitive, did you use TimeSeriesSplit instead of a random shuffle?
  • Entity Grouping: For multi-row entities (like patient records), is GroupKFold used to keep entity data together in the same split?

Key Takeaways

  1. Always split before any step that learns from data.
  2. The test set must simulate future unseen data.
  3. Data leakage leads to misleading performance.
  4. Use Pipeline to avoid leakage during cross-validation.
  5. Following the correct workflow ensures reproducibility and research integrity.

Machine learning is not just about achieving high accuracy; it’s about achieving trustworthy and generalizable performance. A small mistake in workflow, such as preprocessing before splitting, can silently invalidate your entire evaluation. Production systems do not forgive leakage. Real-world data will not contain hints from the future. If your model was trained with hidden information from the test set, its performance was never real to begin with. If you’re serious about building production-ready systems or publishing credible research, this principle is non-negotiable. If your workflow is wrong, your accuracy is meaningless.

Further Reading:

If you’d like to explore these concepts in more depth:


Data Leakage in Machine Learning: Why You Must Split Before Preprocessing was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked