In this lesson, you will learn how to make ML systems reliable, correct, and production-ready through structured testing and validation. You will walk through unit tests, integration tests, load and performance checks, fixtures, code quality tools, and automated test runs, giving you everything you need to ensure your ML API behaves predictably under real-world conditions.

pytest-tutorial-mlops-testing-fixtures-locust-load-testing-featured.png

This lesson is the last of a 2-part series on Software Engineering for Machine Learning Operations (MLOps):

FastAPI for MLOps: Python Project Structure and API Best Practices
Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing (this tutorial)

To learn how to test, validate, and stress-test your ML services like a professional MLOps engineer, just keep reading.

Looking for the source code to this post?

Jump Right To The Downloads Section

Introduction to MLOps Testing: Building Reliable ML Systems with Pytest

Testing is the backbone of reliable MLOps. A model might look great in a notebook, but once wrapped in services, APIs, configs, and infrastructure, dozens of things can break silently: incorrect inputs, unexpected model outputs, missing environment variables, slow endpoints, and downstream failures. This lesson ensures you never ship those problems into production.

In this lesson, you will learn the complete testing workflow for machine learning (ML) systems: from small, isolated unit tests to full API integration checks and load testing your endpoints under real traffic conditions. You will also understand how to structure your tests, how each type of test fits into the MLOps lifecycle, and how to design a test suite that grows cleanly as your project evolves.

To learn how to validate, benchmark, and harden your ML applications for production, just keep reading.

Why Testing Is Non-Negotiable in MLOps

Machine learning adds layers of unpredictability on top of regular software engineering. Models drift, inputs vary, inference latency can increase, and small code changes can ripple into major behavioral shifts. Without testing, you have no safety net. Proper tests make your system observable, predictable, and safe to deploy.

What You Will Learn: Pytest, Fixtures, and Load Testing for MLOps

You will walk through a practical testing workflow tailored for ML applications: writing unit tests for inference logic, validating API endpoints end-to-end, using fixtures to isolate environments, verifying configuration behavior, and running load tests to understand real-world performance. Each example connects directly to the codebase you built earlier.

From FastAPI to Testing: Extending Your MLOps Pipeline with Validation

Previously, you learned how to structure a clean ML codebase, configure environments, separate services, and expose reliable API endpoints. Now, you will stress-test that foundation. This lesson transforms your structured application into a validated, production-ready system with tests that catch issues before users ever see them.

Test-Driven MLOps: Applying Software Testing Best Practices to ML Pipelines

Test-driven development (TDD) matters even more in ML because models introduce uncertainty on top of normal software complexity. A single mistake in preprocessing, an incorrect model version, or a slow endpoint can break your application in ways that are hard to detect without a structured testing strategy. Test-driven MLOps gives you a predictable workflow: write tests, run them often, and let failures guide improvements.

What to Test in MLOps Pipelines: Models, APIs, and Configurations

ML systems require testing across multiple layers because issues can appear anywhere: in preprocessing logic, service code, configuration loading, API endpoints, or the model itself. You should verify that your inference service behaves correctly with both valid and invalid inputs, that your API returns consistent responses, that your configuration behaves as expected, and that the entire pipeline works end-to-end. Even when using a dummy model, testing ensures that the structure of your system remains correct as the real model is swapped in later.

Unit vs Integration vs Performance Testing

Unit tests focus on the smallest pieces of your system: functions, helper modules, and the inference service. They run fast and break quickly when a small change introduces an error. Integration tests validate how components work together: routes, services, configs, and the FastAPI layer. They ensure your API behaves consistently no matter what changes inside the codebase. Performance tests simulate real user traffic, evaluating latency, throughput, and failure rates under load. Together, these 3 types of tests create full confidence in your ML application.

The Software Testing Pyramid for MLOps: Unit, Integration, and Load Testing

The testing pyramid helps prioritize effort: many unit tests at the bottom, fewer integration tests in the middle, and a small number of heavy performance tests at the top. ML systems especially benefit from this structure because most failures occur in smaller utilities and service functions, not in the final API layer. By weighting your test suite correctly, you get fast feedback during development while still validating the entire system before deployment.

Project Structure and Test Layout

A clean testing layout makes your ML system predictable, scalable, and easy to maintain. By separating tests into clear categories (e.g., unit, integration, and performance), you ensure that each kind of test has a focused purpose and a natural home inside the repository. This structure also mirrors how real production MLOps teams organize their work, making your project easier to extend as your system grows.

Test Directory Structure for MLOps: unit, integration, and performance

Your Lesson 2 repository includes a dedicated tests/ directory with 3 subfolders:

tests/
│── unit/
│── integration/
└── performance/

unit/: holds small, fast tests that validate individual pieces such as the DummyModel, the inference service, or helper functions.
integration/: contains tests that spin up the FastAPI app and verify endpoints like /health, /predict, and the OpenAPI docs.
performance/: includes Locust load testing scripts that simulate real traffic hitting your API to measure latency, throughput, and error rates.

This layout ensures that each type of test is separated by intent and runtime cost, giving you a clean way to scale your test suite over time.

Understanding Pytest Fixtures: Using conftest.py for Reusable Test Setup

The conftest.py file is the backbone of your testing environment. Pytest automatically loads fixtures defined here and makes them available across all test files without explicit imports.

Your project uses conftest.py to provide:

FastAPI TestClient fixture: allows integration tests to call your API exactly the way a real HTTP client would.
Sample input data: keeps repeated values out of your test files.
Expected outputs: help tests stay focused on behavior rather than setup.

This shared setup reduces duplication, keeps tests clean, and ensures consistent test behavior across the entire suite.

Where to Place Tests in MLOps Projects: Unit vs Integration vs Performance

A simple rule-of-thumb keeps your test organization disciplined:

Put tests in unit/ when the code under test does not require a running API or external system.
Example: testing that the DummyModel.predict() returns “positive” for the word great.
Put tests in integration/ when the test needs the full FastAPI app running.
Example: calling /predict and checking that the API returns a JSON response.
Put tests in performance/ when measuring speed, concurrency limits, or error behavior under load.
Example: Locust scripts simulating dozens of users sending /predict requests at once.

Following this pattern ensures your tests remain stable, fast, and easy to reason about as the project grows.

Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.

Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Unit Testing in MLOps with Pytest

Unit tests are your first safety net in MLOps. Before you hit the API, spin up Locust, or ship to production, you want to know: Does my core prediction code behave exactly the way I think it does?

In this lesson, you do that by testing 2 things in isolation:

inference service: services/inference_service.py
dummy model: models/dummy_model.py

All of that is captured in tests/unit/test_inference_service.py.

The Code Under Test: Inference Service and Dummy Model

First, recall what you are testing.

services/inference_service.py

"""
Simple inference service for making model predictions.
"""
from models.dummy_model import DummyModel
from core.logger import logger

# Initialize model
model = DummyModel()
logger.info(f"Loaded model: {model.model_name}")


def predict(input_text: str) -> str:
    """
    Make a prediction using the loaded model.
   
    Args:
        input_text: Input text for prediction
       
    Returns:
        Prediction result as string
    """
    logger.info(f"Making prediction for input: {input_text[:50]}...")
   
    try:
        prediction = model.predict(input_text)
        logger.info(f"Prediction result: {prediction}")
        return prediction
    except Exception as e:
        logger.error(f"Error during prediction: {str(e)}")
        raise

This file does 3 things:

Initializes a DummyModel once at import time and logs that it loaded.
Exposes a predict(input_text: str) -> str function that:
- Logs the incoming input (truncated to 50 chars).
- Calls model.predict(...).
- Logs and returns the prediction.
Catches any exception, logs the error, and re-raises it so failures are visible.

You are not testing FastAPI here, just pure Python logic: given some text, does this function consistently return the correct label?

models/dummy_model.py

"""
Placeholder dummy model class.
"""
from typing import Any


class DummyModel:
    """
    A placeholder ML model class that returns fixed predictions.
    """
   
    def __init__(self) -> None:
        """Initialize the dummy model."""
        self.model_name = "dummy_classifier"
        self.version = "1.0.0"
   
    def predict(self, input_data: Any) -> str:
        """
        Make a prediction (returns a fixed string for demonstration).
       
        Args:
            input_data: Input data for prediction
           
        Returns:
            Fixed prediction string
        """
        text = str(input_data).lower()
        if "good" in text or "great" in text:
            return "positive"
        return "negative"

This model is deliberately simple:

The constructor sets model_name and version for logging and version tracking.
The predict() method:
- Converts any input to lowercase text.
- Returns "positive" if it sees "good" or "great" in the text.
- Returns "negative" otherwise.

Your unit tests will assert that both the service and model behave exactly like this.

Writing Pytest Unit Tests for MLOps: test_inference_service.py

Here is the full unit test module:

"""
Unit tests for the inference service.
"""
import pytest
from services.inference_service import predict
from models.dummy_model import DummyModel


class TestInferenceService:
    """Test class for inference service."""
   
    def test_predict_returns_string(self):
        """Test that predict() returns a string."""
        result = predict("some input text")
        assert isinstance(result, str)
   
    def test_predict_positive_input(self):
        """Test prediction with positive input."""
        result = predict("This is good")
        assert result == "positive"
   
    def test_predict_negative_input(self):
        """Test prediction with negative input."""
        result = predict("This is bad")
        assert result == "negative"


class TestDummyModel:
    """Test class for DummyModel."""
   
    def test_model_initialization(self):
        """Test that the model initializes correctly."""
        model = DummyModel()
        assert model.model_name == "dummy_classifier"
        assert model.version == "1.0.0"
   
    def test_predict_with_good_word(self):
        """Test that the model returns positive for 'good'."""
        model = DummyModel()
        result = model.predict("This is good")
        assert result == "positive"
   
    def test_predict_with_great_word(self):
        """Test that the model returns positive for 'great'."""
        model = DummyModel()
        result = model.predict("This is great")
        assert result == "positive"
   
    def test_predict_without_keywords(self):
        """Test that the model returns negative without keywords."""
        model = DummyModel()
        test_inputs = ["test", "random text", "negative sentiment"]
        for input_text in test_inputs:
            result = model.predict(input_text)
            assert result == "negative"

Let us break it down.

Testing the Inference Service with Pytest (MLOps Unit Tests)

The first test class focuses on the service function, not the API:

class TestInferenceService:
    """Test class for inference service."""
   
    def test_predict_returns_string(self):
        """Test that predict() returns a string."""
        result = predict("some input text")
        assert isinstance(result, str)

This test ensures predict() always returns a string, no matter what you pass in.
If someone later changes predict() to return a dict, tuple, or Pydantic model, this test will fail immediately.

    def test_predict_positive_input(self):
        """Test prediction with positive input."""
        result = predict("This is good")
        assert result == "positive"
   
    def test_predict_negative_input(self):
        """Test prediction with negative input."""
        result = predict("This is bad")
        assert result == "negative"

These 2 tests verify the happy-path behavior:

Text containing "good" should be classified as "positive".
Text without "good" or "great" should default to "negative".

Notice what’s not happening here:

No FastAPI client.
No HTTP calls.
No environment or config loading.

This is pure, fast, deterministic testing of the core service logic.

Testing ML Models in Isolation with Pytest

The second test class targets the model directly:

class TestDummyModel:
    """Test class for DummyModel."""
   
    def test_model_initialization(self):
        """Test that the model initializes correctly."""
        model = DummyModel()
        assert model.model_name == "dummy_classifier"
        assert model.version == "1.0.0"

This verifies that your model is initialized correctly.
In real projects, this might include loading weights, setting up devices, or configuration. Here, it is just model_name and version, but the pattern is the same.

    def test_predict_with_good_word(self):
        """Test that the model returns positive for 'good'."""
        model = DummyModel()
        result = model.predict("This is good")
        assert result == "positive"
   
    def test_predict_with_great_word(self):
        """Test that the model returns positive for 'great'."""
        model = DummyModel()
        result = model.predict("This is great")
        assert result == "positive"

These tests assert that the keyword-based classification logic works: both "good" and "great" map to "positive".

    def test_predict_without_keywords(self):
        """Test that the model returns negative without keywords."""
        model = DummyModel()
        test_inputs = ["test", "random text", "negative sentiment"]
        for input_text in test_inputs:
            result = model.predict(input_text)
            assert result == "negative"

This test loops over several neutral and negative phrases to make sure the model consistently returns “negative” when no positive keywords are present.
This is your guardrail against accidental changes to the keyword logic.

How to Run Pytest Unit Tests for MLOps Projects

To run just these tests:

pytest tests/unit/ -v

Or with Poetry:

poetry run pytest tests/unit/ -v

You will see output similar to:

tests/unit/test_inference_service.py::TestInferenceService::test_predict_returns_string PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_positive_input PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_negative_input PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_model_initialization PASSED
...

When everything is green, you know:

Your core prediction logic is stable.
The dummy model behaves exactly as designed.
You can now safely move on to integration tests and performance tests in later sections.

Integration Testing in MLOps

Unit tests validate your core Python logic, but integration tests answer a different question:

“Does the entire application behave correctly when all components work together?”

This means testing:

FastAPI app
routing layer
service functions
model
configuration loaded at runtime

All of this happens using FastAPI’s TestClient and your actual running application object (app from main.py).

Let’s break it down.

Using FastAPI TestClient for Integration Testing with Pytest

Your conftest.py defines a reusable client fixture:

from fastapi.testclient import TestClient
from main import app

@pytest.fixture
def client():
    """Create a test client for the FastAPI app."""
    return TestClient(app)

How FastAPI TestClient Works for API Testing

TestClient(app) spins up an in-memory FastAPI instance.
No server is launched, no networking occurs.
Every test receives a fresh client that behaves exactly like a real HTTP client or API consumer.

This lets you write code such as:

response = client.get("/health")

as if you were calling a real deployed API, but entirely offline and deterministic.

Testing API Endpoints (/health, /predict)

Here is the integration test code from your repo:

class TestHealthEndpoint:
    def test_health_check_returns_ok(self, client):
        response = client.get("/health")

        assert response.status_code == 200
        assert response.json() == {"status": "ok"}
   
    def test_health_check_has_correct_content_type(self, client):
        response = client.get("/health")

        assert response.status_code == 200
        assert "application/json" in response.headers["content-type"]

What Integration Tests Verify in an MLOps API

Your /health route is reachable.
It always returns a 200 response.
It returns valid JSON.
The content type is correct.

Here is the real FastAPI code being tested (main.py):

@app.get("/health")
async def health_check():
    logger.info("Health check requested")
    return {"status": "ok"}

This alignment is exactly correct.

Testing the /predict Endpoint in an MLOps API

Your integration tests call the prediction endpoint:

class TestPredictEndpoint:

    def test_predict_endpoint(self, client):
        response = client.post("/predict", params={"input": "good movie"})
        assert response.status_code == 200
        assert "prediction" in response.json()
   
    def test_predict_positive(self, client):
        response = client.post("/predict", params={"input": "This is a great movie!"})
        assert response.status_code == 200
        assert response.json()["prediction"] == "positive"
   
    def test_predict_negative(self, client):
        response = client.post("/predict", params={"input": "This is bad"})
        assert response.status_code == 200
        assert response.json()["prediction"] == "negative"

This tests:

The endpoint exists and accepts POST requests.
The parameter is correctly passed using params={"input": ...}.
The internal inference logic (service → model) behaves correctly end-to-end.

Here is the actual API endpoint in your main.py:

@app.post("/predict")
async def predict_route(input: str):
    return {"prediction": predict_service(input)}

Perfect 1:1 match.

Testing Documentation Endpoints (/docs, /openapi.json)

These are built into FastAPI and must exist for production ML systems.

Your tests:

class TestAPIDocumentation:
    def test_openapi_schema_accessible(self, client):
        response = client.get("/openapi.json")

        assert response.status_code == 200
        schema = response.json()
        assert "openapi" in schema
        assert "info" in schema
   
    def test_swagger_ui_accessible(self, client):
        response = client.get("/docs")

        assert response.status_code == 200
        assert "text/html" in response.headers["content-type"]

What This Ensures

The OpenAPI schema is generated.
Swagger UI loads successfully.
No misconfiguration broke the docs.
Consumers (frontend teams, other ML services, monitoring) can introspect your API.

This is standard for production ML systems.

Testing Error Handling in FastAPI APIs with Pytest

Your code includes error tests that verify robustness:

class TestErrorHandling:
    def test_nonexistent_endpoint_returns_404(self, client):
        response = client.get("/nonexistent")
        assert response.status_code == 404
   
    def test_invalid_method_on_health_endpoint(self, client):
        response = client.post("/health")
        assert response.status_code == 405  # Method Not Allowed
   
    def test_malformed_requests_handled_gracefully(self, client):
        response = client.get("/health")
        assert response.status_code == 200

Integration Test Breakdown: What Each Test Validates

**Table 1:** Key API edge case tests and their importance in ensuring system reliability

These tests ensure your service behaves consistently even when clients behave incorrectly.

How to Run Integration Tests with Pytest in MLOps

To run only the integration tests:

Using pytest directly

pytest tests/integration/ -v

With Poetry

poetry run pytest tests/integration/ -v

With Makefile

make test-integration

You will see output like:

tests/integration/test_api_routes.py::TestHealthEndpoint::test_health_check_returns_ok PASSED
tests/integration/test_api_routes.py::TestPredictEndpoint::test_predict_positive PASSED
tests/integration/test_api_routes.py::TestAPIDocumentation::test_swagger_ui_accessible PASSED
...

Green = your API works correctly end-to-end.

Performance and Load Testing with Locust

Performance testing is critical for ML systems because even a lightweight model can become slow, unstable, or unresponsive when many users hit the API at once. With Locust, you can simulate hundreds or thousands of concurrent users calling your ML inference endpoints and measure how your API behaves under pressure.

This section explains why load testing matters, how Locust works, how your actual test file is structured, and how to interpret its results.

Why Load Testing Is Essential for MLOps and ML APIs

ML inference services have unique scaling behaviors:

Model loading requires significant memory.
Inference latency grows non-linearly under load.
CPU/GPU bottlenecks show up only when multiple users hit the system.
Thread starvation can cause cascading failures.
Autoscaling decisions depend on real-world load patterns.

A service that performs well for one user may fail miserably at 50 users.

Load testing ensures:

The API stays responsive under traffic.
Latency stays under acceptable thresholds.
No unexpected failures or timeouts occur.
You understand the system’s scaling limits before going to production.

Locust is perfect for this because it is lightweight, Python-based, and designed for web APIs.

Locust Load Testing Concepts: Users, Spawn Rate, and Tasks Explained

Locust simulates user behavior using simple Python classes.

Users

A “user” is an independent client that continuously makes requests to your API.

Example:

10 users = 10 active clients repeatedly calling /predict.

Spawn rate

How quickly Locust ramps up users.

Example:

spawn rate 2 = add 2 users per second until target is reached.

This helps simulate realistic traffic spikes instead of instantly launching all users.

Tasks

Each simulated user executes a set of tasks (e.g., repeatedly calling the /predict endpoint).

Every task can have a weight:

Higher weight = more frequent calls.

This lets you mimic real user patterns like:

90% predict calls
10% health checks

Your project does exactly this.

Writing the locustfile.py

from locust import HttpUser, task, between

class MLAPIUser(HttpUser):
    """
    Locust user class for testing the ML API.
   
    Simulates a user making requests to the API endpoints.
    """
   
    # Wait between 1 and 3 seconds between requests
    wait_time = between(1, 3)
   
    @task(10)
    def test_predict(self):
        """
        Test the predict endpoint.
       
        This task has weight 10, making it the most frequently called.
        """
        payload = {"input": "The movie was good"}
        with self.client.post("/predict", params=payload, catch_response=True) as response:
            if response.status_code == 200:
                response_data = response.json()
                if "prediction" in response_data:
                    response.success()
                else:
                    response.failure(f"Missing prediction in response: {response_data}")
            else:
                response.failure(f"HTTP {response.status_code}")
   
    def on_start(self):
        """
        Called when a user starts testing.
       
        Used for setup tasks like authentication.
        """
        # Verify the API is reachable
        response = self.client.get("/health")
        if response.status_code != 200:
            print(f"Warning: API health check failed with status {response.status_code}")

What This Locust Load Test Validates in an MLOps API

Creates a simulated user (MLAPIUser) that calls /predict.
Gives the /predict task a weight of 10, making it the dominant request.
Sends realistic input (“The movie was good”).
Validates:
- Response code is 200.
- JSON contains “prediction”.
Marks failures explicitly for clean reporting.
On startup, each user verifies that /health works.

This matches your API perfectly:

/predict is POST with query parameter input=...
/health is GET and returns status OK

Nothing needs to be changed; this is production-quality.

Running Locust: Headless Mode vs Web UI Dashboard

Locust supports two modes.

A. Web UI Mode (Interactive Dashboard)

Launch Locust:

locust -f tests/performance/locustfile.py --host=http://localhost:8000

Then open:

http://localhost:8089

You will see a dashboard where you can:

Set number of users
Set spawn rate
Start/stop tests
View real-time stats

B. Headless Mode (Automated CI/CD or scripting)

You already have a script:

software-engineering-mlops-lesson2/scripts/run_locust.sh

Run:

./scripts/run_locust.sh http://localhost:8000 10 2 5m

This executes:

10 users
spawn rate 2 users per second
run time 5 minutes
save HTML report

No UI; perfect for pipelines.

Generating Locust Load Testing Reports for ML APIs

Your script uses:

--html="reports/locust_reports/locust_report_<timestamp>.html"

Which produces files like:

reports/locust_reports/locust_report_20251030_031331.html

Each report includes:

Requests per second (RPS)
Failure stats
Full latency distribution
Percentiles (50th, 95th, 99th)
Charts of active users and response times

These HTML reports are great for:

Comparing deployments
Regression testing API performance
Flagging slow model versions
Archiving performance history

Everything is already correctly set up in your repo.

Understanding Test Metrics (RPS, failures, latency, P95/P99)

Locust gives several performance metrics you must understand for ML systems.

Requests per Second (RPS)

How many inference calls your API can handle per second.

CPU-bound models lead to low RPS
Simple models lead to high RPS

Increasing users will show where your model and server saturates.

Failures

Locust marks a request as failed when:

Status code ≠ 200
Response JSON does not contain “prediction”
Timeout occurs
Server returns an internal error

Your catch_response=True logic handles this explicitly.

This prevents “hidden” failures.

Latency (ms)

Response time per request, typically measured in milliseconds.

For ML, latency is the most important metric.

You will see:

Average latency
Median (P50)
Slowest (max latency)

P95 / P99 (Tail Latency)

The 95th and 99th percentile response times.

These capture worst-case behavior.

Example:

P50 = 40 ms
P95 = 210 ms
P99 = 540 ms

This means:

Most users see fast responses, but a small % experience major slowdowns.

This is common in ML workloads due to:

Model warmup
Thread contention
Python GIL blockage
Model cache misses

Production SLOs usually track P95 and P99, not averages.

MLOps Test Configuration: YAML and Environment Variables

ML systems behave differently across production, development, and testing environments.

Your Lesson 2 codebase separates these environments cleanly using:

A test-specific YAML config
A modified BaseSettings loader
.env overrides for test mode

This ensures that tests run quickly, deterministically, and without polluting real environment settings.

Let’s break down how this works.

Understanding test_config.yaml for MLOps Testing

# Test Configuration
environment: "test"
log_level: "DEBUG"

# API Configuration
api_host: "127.0.0.1"
api_port: 8000
debug: true

# Performance Testing
performance:
  baseline_users: 10
  spawn_rate: 2
  test_duration: "5m"

# Model Configuration
model:
  name: "dummy_classifier"
  version: "1.0.0"

What test_config.yaml Controls in MLOps Pipelines

**Table 2:** Configuration keys and their roles in test environment setup

This config prevents tests from accidentally picking up production configs.

Overriding Application Configuration in Test Mode

Your test environment uses a special configuration loader inside:

core/config.py

Here is the real code:

def load_config() -> Settings:
    # Load base settings from environment
    settings = Settings()
   
    # Load additional configuration from YAML if it exists
    config_path = "configs/test_config.yaml"
    if os.path.exists(config_path):
        yaml_config = load_yaml_config(config_path)
       
        # Override settings with YAML values if they exist
        for key, value in yaml_config.items():
            if hasattr(settings, key):
                setattr(settings, key, value)
   
    return settings

How Configuration Overrides Work: YAML and Environment Variables

Step 1: BaseSettings loads environment variables
(.env, operating system (OS) variables, defaults)
Step 2: YAML configuration overrides them
test_config.yaml replaces any matching fields in Settings.
Final output:
The application is now in test mode, completely isolated from development and production environments.

Why Configuration Management Matters in MLOps Testing

Integration tests always use the same port, host, and log settings.
Tests are repeatable and deterministic.
You never accidentally load production API keys or endpoints.
CI/CD pipelines get consistent behavior.

This pattern is very common in real-world MLOps systems.

Using Environment Variables for Test Isolation

Your test environment uses a .env.example file:

# API Configuration
API_PORT=8000
API_HOST=0.0.0.0
DEBUG=true

# Environment
ENVIRONMENT=test

# Logging
LOG_LEVEL=DEBUG

During setup, users run:

cp .env.example .env

This creates the .env used during tests.

Why test-specific .env variables matter

**Table 3:** Environment variables and their impact on test execution

Combined with YAML overrides:

.env → applies defaults

test_config.yaml → overrides final values

This gives you a flexible and safe configuration stack.

Code Quality in MLOps: Linting, Formatting, and Static Analysis Tools

Testing ensures correctness, but code quality tools ensure that your ML system remains maintainable as it grows.

In Lesson 2, you introduce a full suite of professional-quality tooling:

flake8 for linting
Black for auto-formatting
isort for import ordering
MyPy for static typing
Makefile automation for consistency

Together, they enforce the same engineering discipline used on real production ML teams at scale.

Linting Python Code with flake8

Linting catches code smells, stylistic issues, and subtle bugs before they hit production.

Your repository includes a real .flake8 file:

[flake8]
max-line-length = 88
extend-ignore = E203, W503
exclude =
    .git,
    __pycache__,
    .venv,
    venv,
    env,
    build,
    dist,
    *.egg-info,
    .pytest_cache,
    .mypy_cache
per-file-ignores =
    __init__.py:F401
max-complexity = 10

What your flake8 setup enforces:

88-character line limit (matches Black)
Ignores stylistic warnings that Black also overrides (E203,W503)
Avoids checking generated or virtual-env directories
Allows unused imports only in __init__.py files
Enforces a maximum complexity score of 10

Run flake8 manually:

poetry run flake8 .

Or via Makefile:

make lint

Linting becomes part of your day-to-day workflow and prevents style drift across your ML services.

Formatting Python Code with Black Pipelines

Black is an automatic code formatter; it rewrites Python code into a consistent style.

Your Lesson 2 pyproject.toml includes:

[tool.black]
line-length = 88
target-version = ['py39']
include = '.pyi?$'

This means:

All Python files (.py) are formatted.
Max line length is 88 chars.
py39 syntax is allowed.

Format all code:

poetry run black .

Or using the Makefile shortcut:

make format

Black removes tedious decisions about spacing, commas, and line breaks, ensuring all contributors share the same style.

Using isort to Manage Python Imports

isort automatically manages import sorting and grouping.

Your pyproject.toml contains:

[tool.isort]
profile = "black"
multi_line_output = 3

This aligns isort’s output with Black’s formatting rules, avoiding conflicts.

How to Run isort for Clean Python Imports

poetry run isort .

Or via Makefile:

make format

Why This Matters

As ML services grow, import lists become messy. isort keeps them clean and consistent, improving readability exponentially.

Static Type Checking with MyPy for MLOps Codebases

Static typing is increasingly important in MLOps systems, especially when passing models, configs, and data structures between services.

Your repo contains a full mypy.ini:

[mypy]
python_version = 3.9
warn_return_any = True
warn_unused_configs = True
disallow_untyped_defs = False
ignore_missing_imports = True

[mypy-tests.*]
disallow_untyped_defs = False

[mypy-locust.*]
ignore_missing_imports = True

What This Config Enforces

Flags functions that return Any
Warns about unused config options
Does not require type hints everywhere (reasonable for ML codebases)
Skips type-checking external packages (common in ML pipelines)
Allows untyped defs in tests

Run MyPy

poetry run mypy .

Or via Makefile:

make type-check

Why MyPy Is Critical in ML Systems

Prevents silent type errors (e.g., passing a list where a tensor is expected)
Catches config mistakes before runtime
Improves refactor safety for large ML codebases

Using a Makefile to Automate MLOps Testing and Code Quality

Your Makefile automates all key development tasks:

make test          # Run all tests
make test-unit     # Unit tests only
make test-integration
make format        # Black + isort
make lint          # flake8
make type-check    # mypy
make load-test     # Locust performance tests
make clean         # Reset environment

This ensures:

Every developer uses the same commands
CI/CD pipelines can call the same interface
Tooling stays consistent across machines

Example workflow for contributors:

make format
make lint
make type-check
make test

If all commands pass, you know your code is clean, consistent, and ready for production.

Automating Testing with a Pytest Test Runner Script

As your ML system grows, running dozens of unit, integration, and performance tests manually becomes tedious and error-prone.

Lesson 2 includes a fully automated test runner (scripts/run_tests.sh) that enforces a predictable, repeatable workflow for your entire test suite.

This script acts like a miniature CI pipeline that you can run locally. It prints structured logs, enforces failure conditions, and ensures that no test is accidentally skipped.

Running Automated Tests with run_tests.sh

Your repository includes a fully functional test runner:

#!/bin/bash

# Test Runner Script for MLOps Lesson 2

set -e

echo "🧪 Running MLOps Lesson 2 Tests..."

# Colors for output
GREEN='33[0;32m'
YELLOW='33[1;33m'
RED='33[0;31m'
NC='33[0m'

print_status() {
    echo -e "${GREEN}✅ $1${NC}"
}

print_warning() {
    echo -e "${YELLOW}⚠️  $1${NC}"
}

print_error() {
    echo -e "${RED}❌ $1${NC}"
}

# Run unit tests
echo ""
echo "📝 Running unit tests..."
poetry run pytest tests/unit/ -v
if [ $? -eq 0 ]; then
    print_status "Unit tests passed"
else
    print_error "Unit tests failed"
    exit 1
fi

# Run integration tests
echo ""
echo "🔗 Running integration tests..."
poetry run pytest tests/integration/ -v
if [ $? -eq 0 ]; then
    print_status "Integration tests passed"
else
    print_error "Integration tests failed"
    exit 1
fi

echo ""
print_status "All tests completed successfully!"

How to Run It

./scripts/run_tests.sh

or, via Makefile:

make test

What It Does

Runs unit tests
Runs integration tests
Stops immediately (set -e) if anything fails
Prints colored output for clarity
Provides a clear pass/fail summary

This mirrors real CI pipelines where a failing test stops deployment.

Understanding Pytest Output and Test Results

When you run the script, you will typically see output like this:

🧪 Running MLOps Lesson 2 Tests...

📝 Running unit tests...
============================= test session starts ==============================
collected 7 items

tests/unit/test_inference_service.py::TestInferenceService::test_predict_returns_string PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_positive_input PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_negative_input PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_model_initialization PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_predict_with_good_word PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_predict_with_great_word PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_predict_without_keywords PASSED

============================== 7 passed in 0.45s ===============================
✅ Unit tests passed

Then integration tests:

🔗 Running integration tests...

tests/integration/test_api_routes.py::TestHealthEndpoint::test_health_check_returns_ok PASSED
tests/integration/test_api_routes.py::TestPredictEndpoint::test_predict_positive PASSED
tests/integration/test_api_routes.py::TestAPIDocumentation::test_swagger_ui_accessible PASSED
tests/integration/test_api_routes.py::TestErrorHandling::test_nonexistent_endpoint_returns_404 PASSED

============================== 8 passed in 0.78s ===============================
✅ Integration tests passed

Finally:

✅ All tests completed successfully!

Why Automated Testing Workflows Matter in MLOps

You see exactly which tests failed.
You immediately know whether the API is healthy.
You build the habit of treating tests as a gatekeeper before shipping ML code.

This is foundational MLOps workflow discipline.

Integrating Pytest into CI/CD Pipelines

Your test runner is already written as if it were part of CI.

Very soon, you will plug this into:

GitHub Actions
GitLab CI
CircleCI
AWS CodeBuild
Azure DevOps

A typical GitHub Actions step would look like:

- name: Run Tests
  run: ./scripts/run_tests.sh

Since your script exits with non-zero status on failures, the CI job fails automatically.

What this enables in production ML workflows:

No pull request gets merged unless tests pass
Deployments are blocked if integration tests fail
Load testing can be added as a gated step
Test failures provide early feedback on regressions
Teams enforce consistent standards across developers

You already have everything CI needs:

A deterministic test runner
A strict exit-on-fail system
Separate unit and integration test layers
Makefile wrappers for automation
Poetry ensuring repeatable environments

Once you introduce CI/CD in later lessons, these scripts plug in seamlessly.

Automating Load Testing in MLOps with Locust Scripts

Performance testing becomes essential once an ML API starts supporting real traffic. You want confidence that your inference service will not collapse under load, that p95/p99 latencies remain acceptable, and that the system behaves predictably when scaling horizontally.

Manually running Locust is fine for experimentation, but production MLOps requires automated, repeatable load tests. Lesson 2 provides a dedicated script (run_locust.sh) which allows you to run performance tests in a single line and automatically generate HTML reports for analysis.

Running Automated Locust Load Tests with run_locust.sh

#!/bin/bash

# Simple Locust Load Testing Script for MLOps Lesson 2

set -e

echo "🚀 Starting Locust Load Testing..."

# Configuration
HOST=${1:-"http://localhost:8000"}
USERS=${2:-10}
SPAWN_RATE=${3:-2}
RUN_TIME=${4:-"5m"}

echo "🔧 Configuration: $USERS users, spawn rate $SPAWN_RATE, run time $RUN_TIME"

# Create reports directory
mkdir -p reports/locust_reports

# Check if the API is running
echo "🏥 Checking if API is running..."
if ! curl -s "$HOST/health" > /dev/null; then
    echo "❌ API is not reachable at $HOST"
    echo "Please start the API server first with: python main.py"
    exit 1
fi

echo "✅ API is reachable"

# Run Locust load test
echo "🧪 Starting load test..."

TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
HTML_REPORT="reports/locust_reports/locust_report_$TIMESTAMP.html"

poetry run locust 
    -f tests/performance/locustfile.py 
    --host="$HOST" 
    --users="$USERS" 
    --spawn-rate="$SPAWN_RATE" 
    --run-time="$RUN_TIME" 
    --html="$HTML_REPORT" 
    --headless

echo "✅ Load test completed!"
echo "📊 Report: $HTML_REPORT"

How to Run It

Basic load test:

./scripts/run_locust.sh

10 users, spawn rate 2 users/sec, run for 5 minutes.

Custom parameters:

./scripts/run_locust.sh http://localhost:8000 30 5 2m

This means:

30 users total
5 users per second spawn rate
2-minute runtime
Tests /predict endpoint repeatedly (because of locustfile.py)

What This Script Automates

API health check before running
Creates timestamped report directories
Runs Locust in headless mode
Stores HTML reports for analysis
Fails gracefully when API is unreachable

This gives you a push-button reproducible performance test, a key requirement in professional MLOps.

Automatically Generating Load Testing Reports for ML APIs

Every run creates a unique HTML report:

reports/locust_reports/
    locust_report_20251203_031331.html
    locust_report_20251203_041215.html
    ...

This file includes:

Requests per second (RPS)
Response time percentiles (p50, p90, p95, p99)
Failure rates
Total requests
Charts for concurrency vs performance
Per-endpoint performance metrics

You can open the report in your browser:

open reports/locust_reports/locust_report_20251203_031331.html

(Windows)

start reportslocust_reportslocust_report_XXXX.html

Why This Is Important

Performance regressions are one of the most common ML service failures:

model upgrades slow down inference unintentionally
logging overhead increases latency
new preprocessing increases CPU usage
hardware changes alter throughput

By keeping each test run stored, you can compare historical performance.

This is the foundation of automatic performance regression detection.

Preparing Load Testing for CI/CD and Cloud MLOps Pipelines

Your load testing script is already CI-ready.

Here is how it fits into a production MLOps pipeline.

Option 1 — GitHub Actions

- name: Run Load Tests
  run: ./scripts/run_locust.sh http://localhost:8000 20 5 1m

Since the script exits non-zero on error, it becomes a gated step:

Deployment is blocked if the API cannot sustain the expected load.
Only performant builds reach production.

Option 2 — Nightly Performance Jobs

Teams often run Locust nightly to catch degradations early:

baseline: 20 users
alert if p95 > 300 ms
alert if failures > 1%

Reports are archived automatically via your script.

Option 3 — Cloud Load Testing (AWS/GCP/Azure)

Your script can run inside:

AWS CodeBuild
Azure Pipelines
Google CloudBuild

Simply modify the host:

./scripts/run_locust.sh https://staging.mycompany.com/api 50 10 10m

Why CI Load Tests Matter

Prevents slow releases from being deployed
Ensures model swaps do not tank performance
Protects SLAs (Service Level Agreements)
Helps capacity planning and autoscaling decisions
Detects bottlenecks before customers do

Your repository already contains everything needed to industrialize performance testing.

Test Coverage in MLOps: Measuring and Improving Code Coverage

Even with strong unit, integration, and performance testing, you still need a way to quantify how much of your codebase is actually exercised. This is where test coverage comes in. Coverage tools show you which lines are tested, which are skipped, and where hidden bugs may still be lurking. This is especially important in ML systems, where subtle code paths (error handling, preprocessing, retry logic) can easily be missed.

Your Lesson 2 environment includes pytest-cov, allowing you to generate detailed coverage reports in a single command.

Using pytest-cov to Measure Test Coverage

Coverage is enabled simply by adding --cov flags to pytest.

Basic usage:

pytest --cov=.

Your repo’s pyproject.toml installs pytest-cov automatically under [tool.poetry.group.dev.dependencies], so coverage works out of the box.

A more detailed command:

pytest --cov=. --cov-report=term-missing

This reports:

total coverage percentage
which lines were executed
which lines were missed
hints for improving coverage

Example output you might see:

---------- coverage: platform linux, python 3.9 ----------
Name                                Stmts   Miss  Cover
--------------------------------------------------------
services/inference_service.py          22      0   100%
models/dummy_model.py                  16      0   100%
core/config.py                         40      8    80%
core/logger.py                         15      0   100%
tests/unit/test_inference_service.py   28      0   100%
--------------------------------------------------------
TOTAL                                 121      8    93%

This gives immediate visibility into which modules need more test attention.

How to Measure Code Coverage in MLOps Projects

To formally measure coverage for Lesson 2, run:

pytest -v --cov=. --cov-report=html

This generates a full HTML report inside:

htmlcov/index.html

Open it in your browser:

open htmlcov/index.html

(Windows)

start htmlcovindex.html

The HTML report visualizes:

executed vs missed lines
branch coverage
per-module summaries
clickable source code with line highlighting

This is the gold standard report format used in industry pipelines.

Integrating Coverage into Your Workflow

Your Makefile could easily support it:

make coverage

But even without that, pytest-cov gives you everything you need to evaluate test completeness.

How to Increase Test Coverage in MLOps Pipelines

ML systems often have unusual testing challenges:

multiple code paths depending on data
dynamic model loading
error cases that only appear in production
preprocessing/postprocessing steps
branching logic based on config values
retry and timeout logic
logging behavior that might hide bugs

To increase coverage meaningfully:

1. Test failure modes

Example: model not loaded, invalid input, exceptions in service layer.

2. Test alternative branches

For example., your dummy model has:

if "good" in text or "great" in text:
    return "positive"
return "negative"

Coverage increases when you test:

positive branch
fallback branch
edge cases like empty strings

3. Test configuration-dependent behavior

Since your system loads from:

.env
YAML
runtime values

Try testing scenarios where each layer overrides the next.

4. Test logging paths

Logging is crucial in MLOps, and ensuring logs appear where expected also contributes to coverage.

5. Test the API under different payloads

Missing parameters, malformed types, unexpected values.

6. Test integration between modules

Even simple ML systems can break across module boundaries, so testing interactions raises coverage dramatically.

Recommended Test Coverage Targets for MLOps Systems

High coverage is good, but perfection is unrealistic and unnecessary.

Here are industry-grade ML-specific targets:

**Table 4:** Recommended test coverage ranges across system components

Why You Do Not Aim for 100%

ML models are often treated as black boxes
Some branches (especially failure conditions) are difficult to simulate
Performance code paths are not always practical to test

A strong MLOps system targets:

Overall coverage: 80-90%

This ensures the most important logic is covered while avoiding diminishing returns.

Critical paths: 100%

Inference, preprocessing, conversion, routing, safety checks.

Performance-sensitive code: covered via load tests

This is why Locust complements pytest rather than replacing it.

What’s next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you’re serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you’ll find:

&check; 86+ courses on essential computer vision, deep learning, and OpenCV topics
&check; 86 Certificates of Completion
&check; 115+ hours hours of on-demand video
&check; Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
&check; Pre-configured Jupyter Notebooks in Google Colab
&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
&check; Access to centralized code repos for all 540+ tutorials on PyImageSearch
&check; Easy one-click downloads for code, datasets, pre-trained models, etc.
&check; Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, you learned how to make ML systems safe, correct, and production-ready through a full testing and validation workflow. You started by understanding why ML services need far more than “just unit tests,” and how a layered approach (unit, integration, and performance tests) creates confidence in both the code and the behavior of the system. You then explored a real test layout with dedicated folders, fixtures, and isolation, and saw how each type of test validates a different piece of the pipeline.

From there, you implemented unit tests for the inference service and dummy model, followed by integration tests that exercise real FastAPI endpoints, documentation routes, and error handling. You also learned how to perform load testing with Locust, simulate concurrent users, generate performance reports, and interpret latency and failure metrics. This is an essential skill for production ML APIs.

Finally, you covered the tools that keep an ML codebase clean and maintainable: linting, formatting, static typing, and the Makefile commands that tie everything together. You closed with automated test runners, load-test scripts, and coverage reporting, giving you an end-to-end workflow that mirrors real MLOps engineering practice.

By now, you have seen how professional ML systems are tested, validated, measured, and maintained. This sets you up for the next module, where we will begin building data pipelines and reproducible ML workflows.

Citation Information

Singh, V. “Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/4ztdu

@incollection{Singh_2026_pytest-tutorial-mlops-testing-fixtures-locust-load-testing,
  author = {Vikram Singh},
  title = {{Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/4ztdu},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you’ll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

The post Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing appeared first on PyImageSearch.

Like 0

Liked Liked