What Actually Breaks ML Models in Production: A Fintech Case Study

Subtitle: Real production incidents from fintech classification models — and the engineering fixes that actually worked

Introduction: The Silent Failures Nobody Talks About

Your model just went live. Training metrics looked great — 0.92 AUC, precision and recall perfectly balanced. The deployment pipeline ran without a single error.

Two weeks later, your product manager sends a Slack message: “Why are we rejecting 40% more applications than last month?”

You check the monitoring dashboard. Everything shows green. The model is running. Predictions are being generated. No exceptions logged.

This is what most ML production failures actually look like.

They don’t crash. They don’t throw errors. They just quietly start making worse decisions, and by the time anyone notices, thousands of predictions have already gone wrong.

After working on multiple fintech classification models — credit risk, fraud detection, loan approvals — I’ve seen the same patterns repeat. This article documents the actual production incidents we encountered, why they were invisible to traditional monitoring, and what engineering practices actually prevented them from happening again.

*Figure: What people think breaks ML in production vs. what actually breaks*

These aren’t theoretical failure modes from research papers. These are real problems that cost real money and damaged real user experiences.

Failure #1: Training–Serving Skew (When Your Model Lives in Two Different Worlds)

The Incident

A credit risk model started assigning unexpectedly high risk scores to new users. Approval rates dropped by 15% over three weeks. The strange part? Our offline evaluation metrics hadn’t changed at all.

The model wasn’t broken. The features were.

What Was Really Happening

During training, we computed features using complete historical datasets. In production, we computed them using whatever data was actually available at decision time.

Consider a feature like “average transaction amount over the past 90 days”:

Training environment:

  • Full transaction history available
  • Feature computed from complete 90-day windows
  • Missing data filled using forward-fill or interpolation

Production environment:

  • Real-time feature store with 30-day retention
  • New users had only 5–10 days of history
  • No imputation logic at inference time

The feature had the same name in both environments. It absolutely did not have the same distribution.

The Architecture Problem

┌─────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
│ │
│ Historical DB → Feature Engineering → Model Training │
│ (complete data) (90-day windows) (learns patterns) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ PRODUCTION PIPELINE │
│ │
│ Feature Store → Feature Retrieval → Model Inference │
│ (30-day cache) (partial windows) (expects 90d data) │
└─────────────────────────────────────────────────────────────┘

The model was trained on one distribution and served predictions on a completely different one.

*Figure 1: The training pipeline uses 90-day windows with complete data, while production only has 30-day retention*

Why Traditional Monitoring Missed It

Our monitoring tracked:

  • API latency ✓
  • Error rates ✓
  • Prediction volume ✓
  • Model version ✓

What it didn’t track:

  • Feature distribution shifts
  • Input data availability
  • Computation path differences

The model was technically “working” — it just wasn’t working correctly.

The Code That Caused It

Offline training code:

# Training feature computation
def compute_features_offline(user_transactions_df):
"""
Uses full historical data from data warehouse
"""
features = user_transactions_df.groupby('user_id').agg({
'amount': ['mean', 'std', 'min', 'max'],
'transaction_date': 'count'
}).reset_index()

# Compute rolling statistics with full lookback
features['avg_amount_90d'] = (
user_transactions_df
.sort_values('transaction_date')
.groupby('user_id')['amount']
.rolling(window=90, min_periods=1)
.mean()
.reset_index(level=0, drop=True)
)

return features

Online inference code:

# Production feature computation
def compute_features_online(user_id, feature_store):
"""
Uses real-time feature store with limited retention
"""
# Feature store only retains 30 days
recent_txns = feature_store.get_transactions(
user_id=user_id,
days_back=30 # ← This is the problem
)

if len(recent_txns) == 0:
return default_features()

features = {
'avg_amount_90d': recent_txns['amount'].mean(), # Wrong!
'txn_count': len(recent_txns),
# ... other features
}

return features

Notice the disconnect? The training code expected 90 days. The production code could only provide 30.

What Actually Fixed It

1. Unified Feature Computation

We created a single source of truth for feature definitions:

# Shared feature library
class TransactionFeatures:
"""
Single feature definition used in both training and serving
"""
LOOKBACK_WINDOW = 30 # Explicitly defined

@staticmethod
def compute_avg_amount(transactions_df):
"""
Same logic runs in training and production
"""
if len(transactions_df) == 0:
return 0.0

# Filter to exact window used in production
cutoff_date = datetime.now() - timedelta(days=LOOKBACK_WINDOW)
recent = transactions_df[
transactions_df['transaction_date'] >= cutoff_date
]

return recent['amount'].mean() if len(recent) > 0 else 0.0

2. Feature Availability Documentation

We documented what data was actually available at inference time:

FEATURE_DEFINITIONS = {
'avg_amount_30d': {
'description': 'Average transaction amount',
'lookback_days': 30,
'availability': 'real-time',
'min_data_points': 3,
'fallback_value': 0.0
}
}

3. Training-Serving Consistency Tests

We added automated tests that compared feature distributions:

def test_feature_consistency():
"""
Verify training and serving features match
"""
# Generate features using training pipeline
training_features = compute_features_offline(sample_users)

# Generate same features using production pipeline
serving_features = [
compute_features_online(user_id, feature_store)
for user_id in sample_users
]

# Compare distributions
for feature_name in FEATURE_DEFINITIONS.keys():
train_dist = training_features[feature_name]
serve_dist = [f[feature_name] for f in serving_features]

# Statistical comparison
ks_stat, p_value = ks_2samp(train_dist, serve_dist)

assert p_value > 0.05, (
f"Feature {feature_name} distributions differ: "
f"KS statistic = {ks_stat}, p-value = {p_value}"
)

4. Feature Removal Over Feature Engineering

In several cases, we simply removed problematic features:

  • Features requiring data unavailable at inference time → Removed
  • Features with inconsistent computation paths → Removed
  • Features that couldn’t be reliably replicated → Removed

Counterintuitively, removing features improved production stability more than trying to “fix” them with complex imputation strategies.

The Real Lesson

Training-serving skew isn’t a model problem. It’s an engineering problem.

The model trained correctly. The deployment succeeded. The infrastructure worked. But the data pipeline assumptions were fundamentally incompatible between environments.

The fix wasn’t better models. It was better data engineering.

Failure #2: Data Drift That Monitoring Couldn’t See

The Incident

A fraud detection model started flagging legitimate transactions as suspicious. False positive rate jumped from 3% to 12% over six weeks. Customer complaints increased proportionally.

Our drift detection showed… nothing unusual.

What Changed

The model’s input distribution hadn’t drifted. The world had drifted.

A major payment processor changed their transaction metadata format. What was previously a categorical field with 50 distinct values suddenly had 200+ values. Our model had never seen most of them during training.

The Architecture Gap

┌──────────────────────────────────────────────────────────┐
│ TRAINING DATA (2022-2023) │
│ │
│ Transaction Category: ['retail', 'food', 'gas', ...] │
│ Total unique values: 50 │
│ Coverage: 99.8% of transactions │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ PRODUCTION DATA (March 2024) │
│ │
│ Transaction Category: ['retail', 'food', 'gas', ...] │
│ + 150 NEW values from payment processor update │
│ Coverage: 60% of transactions match training │
└──────────────────────────────────────────────────────────┘
*Figure 2: Model trained on 50 categories now seeing 200+ new values from payment processor update*

Why Standard Drift Detection Failed

Our drift monitoring compared statistical distributions:

# Standard drift detection
def detect_drift(reference_data, production_data, feature_name):
"""
Compares distributions using KL divergence
"""
ref_dist = reference_data[feature_name].value_counts(normalize=True)
prod_dist = production_data[feature_name].value_counts(normalize=True)

kl_div = entropy(ref_dist, prod_dist)

return kl_div > DRIFT_THRESHOLD

The problem? For categorical features with many new values, KL divergence doesn’t capture the semantic shift effectively. The distribution might look similar statistically while being completely different semantically.

What Actually Fixed It

1. Vocabulary Monitoring

Track which values the model has actually seen:

class VocabularyMonitor:
def __init__(self, training_vocab):
self.known_values = set(training_vocab)

def check_production_value(self, value):
"""
Flag unknown categorical values
"""
if value not in self.known_values:
logging.warning(
f"Unknown categorical value encountered: {value}"
)
return False
return True

def get_coverage(self, production_data):
"""
Calculate percentage of known values
"""
total = len(production_data)
known = sum(
val in self.known_values
for val in production_data
)
return known / total

2. Semantic Drift Detection

Instead of just comparing distributions, we monitored coverage:

def monitor_categorical_coverage(feature_name, production_batch):
"""
Track how many production values were seen during training
"""
training_values = get_training_vocabulary(feature_name)

production_values = set(production_batch[feature_name].unique())

coverage = len(
production_values & training_values
) / len(production_values)

if coverage < 0.8: # Alert threshold
alert(
f"Only {coverage*100:.1f}% of production "
f"values were seen during training"
)

3. Graceful Unknown Handling

Default handling for unseen categorical values:

def encode_categorical_safe(value, known_encodings):
"""
Handle unknown categorical values gracefully
"""
if value in known_encodings:
return known_encodings[value]
else:
# Map to 'unknown' category instead of crashing
return known_encodings.get('__UNKNOWN__', 0)

The Real Lesson

Drift detection isn’t just about distribution statistics. It’s about tracking what your model has actually learned.

For categorical features, monitoring vocabulary coverage matters more than monitoring statistical moments.

Failure #3: Label Leakage Discovered After Deployment

The Incident

A loan default prediction model achieved 0.96 AUC during training. After deployment, it was essentially useless — barely better than random guessing.

We had label leakage. And it only became obvious in production.

What Leaked

One of our features was “days since last payment.” During training, we computed this using the full transaction history — including transactions that occurred after the loan default.

In production, we could only compute it using data available before the prediction.

The Subtle Bug

Training code:

# This runs on historical data
def compute_days_since_last_payment(user_id, reference_date):
"""
Computes using ALL historical data
"""
all_payments = get_all_payments(user_id) # Includes future!

payments_before_ref = all_payments[
all_payments['date'] <= reference_date
]

if len(payments_before_ref) == 0:
return 999 # Large default value

last_payment = payments_before_ref['date'].max()

return (reference_date - last_payment).days

The problem isn’t obvious until you realize get_all_payments() returned the complete payment history, including payments made after the default date.

*Figure 3: Training feature accidentally used payment data from after the decision date*

What should have been:

def compute_days_since_last_payment(user_id, reference_date):
"""
Computes using only data available at reference_date
"""
# Only get payments BEFORE the reference date
payments = get_payments_before_date(user_id, reference_date)

if len(payments) == 0:
return 999

last_payment = payments['date'].max()

return (reference_date - last_payment).days

Why This Went Undetected

Cross-validation looked perfect because the leakage was consistent across train/validation splits. The model learned a pattern that was only valid when future data was available.

What Actually Fixed It

1. Point-in-Time Feature Engineering

We enforced strict temporal boundaries:

class PointInTimeFeatureStore:
"""
Ensures features only use data available at decision time
"""
def get_features(self, user_id, as_of_date):
"""
as_of_date: The decision timestamp
"""
# Hard constraint: no data after as_of_date
historical_data = self.db.query(
f"""
SELECT * FROM transactions
WHERE user_id = {user_id}
AND transaction_date < '{as_of_date}'
"""
)

return self.compute_features(historical_data)

2. Production-First Feature Development

We reversed the development process:

  1. First: Write the production feature computation code
  2. Second: Replay it on historical data for training
  3. Third: Validate that both paths produce identical results
# Define production logic first
def production_feature_logic(data_before_decision):
return {
'days_since_last_payment': compute_payment_recency(data_before_decision),
'total_transactions': len(data_before_decision),
# ... other features
}
# Use the same logic for training
def create_training_dataset(historical_users, decision_dates):
"""
Replay production logic on historical data
"""
features = []

for user_id, decision_date in zip(historical_users, decision_dates):
# Get data exactly as production would see it
data_available = get_data_before_date(user_id, decision_date)

# Use the exact same feature computation
user_features = production_feature_logic(data_available)

features.append(user_features)

return pd.DataFrame(features)

The Real Lesson

Label leakage isn’t just a data science problem. It’s a time-travel problem.

If your training features use any information that wouldn’t be available at prediction time, your model is learning to predict the past, not the future.

What Actually Prevents These Failures: The Engineering Practices That Worked

After fixing these incidents multiple times, a clear pattern emerged. The solutions weren’t about better algorithms or more sophisticated monitoring. They were about engineering discipline.

Practice 1: Feature Contracts

Treat features like API contracts:

# feature_definitions.yaml
features:
avg_transaction_amount_30d:
description: "Average transaction amount over last 30 days"
data_source: "transactions table"
lookback_window: 30
minimum_data_points: 3
fallback_value: 0.0
computation:
training: "features.offline.compute_avg_amount()"
serving: "features.online.compute_avg_amount()"
validation:
- name: "distribution_match"
threshold: 0.05
- name: "null_rate"
threshold: 0.01

Practice 2: Unified Feature Repositories

One codebase for both training and serving:

feature_repo/
├── __init__.py
├── base.py # Abstract feature interface
├── transaction.py # Transaction features
├── user.py # User features
└── tests/
├── test_consistency.py
└── test_distributions.py
*Figure 4: Single feature repository ensures consistency between training and serving pipelines*

Practice 3: Production-First Development

Step 1: Design feature for production constraints

Step 2: Implement production feature computation

Step 3: Replay production logic on historical data

Step 4: Train model using replayed features

Step 5: Validate training/serving consistency

Practice 4: Continuous Validation

# Runs every hour in production
def validate_production_features():
"""
Automated checks for training-serving consistency
"""
# Sample recent production requests
prod_samples = sample_recent_predictions(n=1000)

# Recompute using training logic
training_features = compute_offline_features(prod_samples)

# Compare
for feature in FEATURE_LIST:
prod_values = prod_samples[feature]
train_values = training_features[feature]

# Statistical comparison
assert_distributions_match(prod_values, train_values)

# Null rate comparison
assert_null_rates_match(prod_values, train_values)

Conclusion: Production ML Is an Engineering Problem

The biggest lesson from these failures is simple: most ML production issues aren’t ML issues.

They’re engineering issues that happen to involve ML models:

  • Training-serving skew is a data pipeline problem
  • Data drift is a monitoring problem
  • Label leakage is a temporal consistency problem

The models themselves were fine. The math was correct. The algorithms worked. What failed was the infrastructure around them.

If you’re building production ML systems, invest more time in:

  • Feature engineering discipline
  • Training-serving consistency
  • Temporal correctness
  • Automated validation

And less time in:

  • Chasing marginal accuracy improvements
  • Complex model architectures
  • Hyperparameter tuning

The model that deploys reliably is better than the model that performs slightly better in offline evaluation.

References and Further Reading

  • Sculley et al. (2015) — “Hidden Technical Debt in Machine Learning Systems”
  • Google’s “Rules of Machine Learning” — #5: Test the infrastructure independently from the model
  • Feast Feature Store Documentation — Consistency Guarantees

Thanks for reading! If you’ve encountered similar production failures or have questions about these fixes, I’d love to hear from you in the comments.


What Actually Breaks ML Models in Production: A Fintech Case Study was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked