Your AI Agent Backend Will Break in Production

A 3-level testing pyramid for SaaS teams shipping AI features that actually need to work.

We broke our first AI agent backend in production on a Tuesday.

Not because the model gave a wrong answer. Because the system around it wasn’t testable. A prompt change we considered minor cascaded through three tool calls, bypassed a guardrail, and returned corrupted output to a live user. No test caught it. No trace explained it.

That’s the moment we stopped treating our agent as a special case and started treating it like any other critical backend service.

If you’re a CPO or CTO at a SaaS company shipping AI features to customers, this is the pattern we wish we’d had from day one.

What an AI Agent Testing Pyramid Actually Is

An AI agent testing pyramid is a layered testing strategy that separates deterministic backend logic from non-deterministic model outputs. It has three levels: unit and contract tests for routing, state, and tool handlers; integration tests that drive the orchestrator with fake model outputs; and scenario replays that re-run recorded real conversations against new code or prompts.

The goal is to make everything around the model testable and predictable, even when the model itself is not. For ISVs embedding AI analytics into their products, this distinction matters enormously: your customers don’t tolerate flaky behavior, regardless of whether it comes from a model or a broken routing rule.

At Toucan, an embedded AI analytics platform for ISVs and SaaS product teams offering white-label, customer-facing analytics via web component or React component with multi-tenant row-level security and conversational AI, each user question triggers a chain of tool calls across our semantic layer, metric library, and data sources. One request can spawn multiple sub-agent hops before returning a visualization. “Run it and see” isn’t a testing strategy at that scale. It’s a liability.

Why AI Agent Testing Is Not Like Testing Normal Code

Traditional backend services follow deterministic code paths. AI agents don’t.

Model outputs are non-deterministic: the same prompt can yield different answers. Behavior is emergent: small prompt changes can have large downstream effects. Workflows are long-lived: one user message may trigger multiple tool calls and sub-agent hops.

If you rely on end-to-end “does it answer correctly?” checks, you end up with flaky tests tied to a specific model provider and version, very slow feedback loops, and no clear signal about where a failure actually originates.

According to the 2024 State of AI Engineering survey by the AI Infrastructure Alliance, 68% of engineering teams shipping LLM features report that test flakiness is their top pain point. For ISVs, this translates directly into delayed releases and customer-facing incidents.

The better approach: make everything around the model boring and testable, and treat the model itself as the only fuzzy part.

Level 1: Unit Tests — Make Your Backend Deterministic

The first step is isolating as much non-AI logic as possible.

The pieces that belong here: context and state reducers (how you update internal state when a tool succeeds or fails), routing and classification logic (how you map user intents to flows), tool handlers (how you translate typed inputs into queries or API calls), and schemas and validation (how you validate model and tool I/O before it reaches business logic).

These components should be 100% deterministic. Given input X and state S, the reducer produces state S’. Given intent Y, the router selects flow F. Given tool input I, the handler calls service Z with parameters P.

For SaaS product teams, this is what keeps an “AI” system from becoming an unmaintainable black box: most of the code behaves like ordinary, unit-testable logic. The model is just one input source among many.

Level 2: Integration Tests With Fake Model Outputs

You don’t want every test run hitting a real LLM. It’s slow, expensive, and makes tests non-deterministic by design.

Instead: stub the model. Replace real model calls with a fake that returns predetermined, typed outputs. If you use LangChain, FakeChatModel handles this natively. Otherwise, a mock on your SDK client (unittest.mock in Python, vi.fn() in TypeScript) is sufficient.

Keep the orchestrator and tools real. The same code you run in production, driven by controlled “model decisions.”

This lets you test questions that actually matter for ISVs: if the model routes to a chart, do we call the right tools in the right order? If a tool fails with a structured error, does the orchestrator fall back correctly? Do we respect limits on tool calls and retries?

You’re not checking whether the answer reads well. You’re checking whether the system reacts correctly to a given sequence of model decisions.

Level 3: Scenario Replays Against Real User Conversations

Unit and integration tests get you far. Real users will still find novel paths through your system.

Scenario replays close that gap. You record real interactions: user messages, context, tool invocations, and outcomes. You store them as anonymized scenarios. You replay them against new versions of your backend logic, new prompts or model providers, and new tools or safety policies.

Tools like LangSmith, Braintrust, or Langfuse handle this out of the box. If you’re starting from scratch, adopt one before building your own infrastructure.

Over time, this builds a catalog of “golden paths” and hard cases you can regression-test. For ISVs, this matters in a specific way: you can ask “does this change make anything worse for real customer queries?” instead of relying only on synthetic benchmarks written by your own team.

Guardrails Are Code, Not Prompts

Guardrails (limits on loops, retries, permissions, data access) are among the most important behaviors to test, and the most commonly mishandled.

The common mistake: putting guardrail logic in the prompt (“don’t call more than 5 tools”). Prompt instructions can be ignored, overridden by context, or simply forgotten across model versions.

The right approach: implement guardrail conditions in normal code with clear conditions and side effects. The max tool call limit is a counter in your orchestrator loop. If toolCallCount >= MAX_TOOL_CALLS, the orchestrator raises a structured exception. A unit test drives it past that limit and asserts it stops cleanly.

This flips guardrails from “prompt hints” to enforceable policies. For SaaS product teams shipping customer-facing AI, this isn’t a nice-to-have. It’s the difference between a support ticket and a security incident.

Testing and Observability Feed Each Other

A testing strategy without observability is incomplete. Tests verify that the right events are emitted at the right time. Observability tells you where to add tests.

In practice: every request needs a trace identifier. Major decisions and errors should emit structured events, not free-text logs. Tests should assert that these events exist when expected.

Langfuse and LangSmith provide this out of the box, with trace IDs, structured event capture, and LLM-specific dashboards. Both integrate with OpenTelemetry if you already have APM tooling.

The result: you can look at a test failure and immediately see the associated trace. You can look at a production incident and quickly turn the exact failing trace into a new scenario replay.

What We’d Do Differently From Day One

Make most of the system non-AI. Move routing, state, validation, and tool logic into deterministic code you can unit-test normally.

Use fake models in integration tests. Drive your orchestrator and tools with predefined outputs to test flows without cost or flakiness.

Start capturing scenario replays early. Real conversations are your most valuable regression suite. LangSmith or Langfuse remove the need for custom infrastructure.

Treat guardrails as code. Implement limits and anti-loop logic in normal code and test them directly.

Tie tests to observability. Use trace IDs and structured events in both tests and production so you can move quickly from “what failed?” to “how do we reproduce it?”

You can’t make AI agents perfectly predictable. You can make the system around them predictable, debuggable, and safe to evolve. That’s what testing actually gives you.

If you’re evaluating how to embed a governed, testable AI analytics layer into your SaaS product, see how Toucan approaches AI agent reliability in production.

About the author David Nowinsky is CTO at Toucan, an embedded AI analytics platform for ISVs and SaaS product teams. Toucan helps product teams ship customer-facing analytics without burning engineering sprints. Learn more at toucantoco.com.


Your AI Agent Backend Will Break in Production was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked