Your AI Agent Works Perfectly in the Demo. Here Are the 6 Ways It Dies in Production.

digitado ⋅ 19 de May de 2026

The demo worked perfectly. You ran it twenty times. You showed it to your team. You showed it to your CTO. Every prompt returned exactly the right output.

Then you deployed it.

Three days later, a customer reported that the agent gave them completely wrong information — confidently, without any error. Your logs showed HTTP 200s all the way down. Your monitoring reported zero errors. The agent had been silently hallucinating for 72 hours, and nothing in your infrastructure had noticed.

This is not a model quality problem. The model was doing exactly what models do. This is an architecture problem — and it’s the problem nobody writes about, because it only becomes visible after you’ve already deployed.

I’ve spent the last year building and reviewing AI agent systems in production. The failure taxonomy is consistent. There are six ways an AI agent dies in production, and almost none of them show up in a demo.

The math that should terrify you

Before the taxonomy, one number worth sitting with.

If your agent achieves 85% accuracy per step — which is a good number, better than many production systems — and your workflow has 10 steps, the probability of completing that workflow successfully is 0.8⁵¹⁰ = 19.7%.

In this simplified model — where steps are independent and success is binary — roughly eight out of ten workflows fail despite each individual step being “pretty good.” Real failure modes are messier than this, and steps are rarely fully independent. But the model captures the architecture problem accurately: multi-step workflows compound failure. The only way out is to build failure handling into every step, not just the last one.

Now, the six failure modes.

Failure 1: Context degradation

In a multi-step agent workflow, the model doesn’t remember what happened two steps ago — you send it. Every API call includes the entire conversation history. And that history grows with every step.

What engineers miss: context doesn’t just grow, it degrades. Datadog’s 2026 State of AI Engineering report documents the pattern precisely: the average token count in production agent workflows more than doubled year-over-year for median-use teams, and quadrupled for heavy users. As context grows, the original instruction becomes diluted — newer tool outputs and summaries crowd out the early reasoning, and the agent continues confidently on increasingly corrupted signal. When this drift surfaces, there is no owner to page, no baseline to compare against, no runbook to execute. It surfaces as a customer complaint.

The agent doesn’t tell you this is happening. The outputs become subtly wrong in ways that are nearly impossible to detect without evaluation tooling.

The pattern that makes it worse: engineers build agents that pass outputs between steps as plain text summaries. The model summarises step 3’s output, passes the summary to step 4, which summarises again for step 5. Each summarisation is a lossy compression. By step 8, you’re acting on a summary of a summary of a summary of the original instruction.

The fix: preserve structured outputs between steps, not prose summaries. Use typed data contracts between agent steps rather than natural language handoffs.

# Instead of this — lossy text handoff
result = agent.run("Summarize what you found and pass it to the next step")

# Do this - structured contract between steps
@dataclass
class StepResult:
    extracted_entities: list[str]
    confidence_scores: dict[str, float]
    raw_source_ids: list[str]  # preserve provenance
    step_number: int

Failure 2: Silent failures

This is the one that keeps engineers up at night, because it’s the one you don’t know about.

Traditional monitoring is completely blind to agent failures. An agent that hallucinates a confident wrong answer still returns HTTP 200. Latency stays normal. Error rate stays at zero. Your dashboards are green. Your Slack alerts are quiet.

Latitude’s production observability research documents the pattern clearly: “Tool misuse is the most common agent-specific failure mode in production — and the most insidious: a single malformed argument at step 2 silently corrupts every subsequent step that depends on that output.” The agent calls a tool with incorrect arguments, selects the wrong tool for the task, or fails to handle a tool error and continues as if the call succeeded.

The classic scenario: a customer support agent that answers questions about account status. In testing, all queries are clean, structured English. In production, queries are messy, multilingual, emotionally charged. The agent returns plausible wrong answers with normal latency and HTTP 200s. The only signal is a customer escalation — which arrives hours or days after the degradation began.

The fix: add a lightweight LLM evaluator layer that scores every agent output before it reaches the user. Not a human in the loop — a small, fast model that checks three things: is this response relevant to the query? Does it contradict the source data? Does the confidence language match what the retrieval actually returned?

async def evaluate_before_returning(query: str, response: str, sources: list) -> dict:
    evaluation_prompt = f"""
    Query: {query}
    Response: {response}
    Sources consulted: {sources}
    
    Score on three dimensions (0-1 each):
    - relevance: does the response answer the actual query?
    - grounding: is the response supported by the sources?
    - calibration: does the certainty language match source quality?
    """
    score = await fast_evaluator.run(evaluation_prompt)
    if score["grounding"] < 0.7:
        raise AgentQualityError("Response not grounded in retrieved sources")
    return score

Failure 3: Tool execution schema drift

Your agent calls tools — APIs, database queries, internal services. Those tools change. When they change, your agent doesn’t know.

This is the API version problem in disguise. An agent calling a tool doesn’t validate that the tool’s response schema matches what it was trained or prompted to expect. When a third-party API updates their response format, or when your internal service adds a required field, or when an OAuth token expires — the agent receives a malformed or empty response and, depending on how you’ve built it, either hallucinates a plausible-looking answer from the gap, or enters a retry loop.

Datadog’s 2026 State of AI Engineering report puts a number on it: in March 2026, rate limit errors alone accounted for nearly 8.4 million errors across their LLM observability dataset. Many of those failures are exactly the kind that teams respond to with retry logic — which brings us to failure mode four.

The fix: validate tool response schemas explicitly before passing them to the model. Treat tool outputs like untrusted external data.

// Go — validate tool response before agent ingestion
type SearchResult struct {
    Documents []Document `json:"documents"`
    TotalHits int        `json:"total_hits"`
    QueryID   string     `json:"query_id"`
}

func validateAndIngest(raw []byte) (*SearchResult, error) {
    var result SearchResult
    if err := json.Unmarshal(raw, &result); err != nil {
        return nil, fmt.Errorf("schema drift detected: tool response malformed: %w", err)
    }
    if len(result.Documents) == 0 && result.TotalHits > 0 {
        return nil, fmt.Errorf("suspicious result: total_hits=%d but documents empty", result.TotalHits)
    }
    return &result, nil
}

Failure 4: Runaway execution and cost cascades

This one has real dollar amounts attached to it.

In one documented incident from November 2025, four LangChain agents entered an infinite clarification loop — an Analyzer and a Verifier ping-ponging requests with no orchestrator deciding when to stop. No step cap. No per-conversation budget. No circuit breaker. The loop ran for 11 days and cost $47,000. Week one cost $127. Week four cost $18,400. Nobody noticed until the invoice arrived. In a separate 2026 case, a single agent on an overnight refactoring task made 14,000 repeated file-listing calls before the account was suspended.

The documented mechanism: every LLM API call is stateless. Agents send the entire conversation history on every call. A loop at step 20 with accumulated tool outputs can exceed 50,000 input tokens per call. At frontier-model prices, a single late-loop step costs cents — but cents per step, across hundreds of retries, with context growing on every call, adds up faster than any account alert will catch it. One hundred agents running simultaneously: you do the math.

The cost curve is non-linear and unforgiving. Individual developers have reported multi-thousand-dollar bills from single autonomous weekend runs. The mechanism is always the same: retry logic with no circuit breaker, context inflation on every retry, and no hard budget cap at the agent level.

The fix: hard budget caps at the agent level, not just the account level. Circuit breakers on retry loops. Step count limits. These are not optional.

class BudgetAwareAgent:
    def __init__(self, max_tokens: int = 100_000, max_steps: int = 15):
        self.tokens_used = 0
        self.steps_taken = 0
        self.max_tokens = max_tokens
        self.max_steps = max_steps

    def before_step(self, estimated_tokens: int):
            self.steps_taken += 1
            if self.steps_taken > self.max_steps:
                raise AgentBudgetError(f"Step limit reached: {self.steps_taken}")
            if self.tokens_used + estimated_tokens > self.max_tokens:
                raise AgentBudgetError(f"Token budget exhausted: {self.tokens_used} used")
        def after_step(self, tokens_consumed: int):
            self.tokens_used += tokens_consumed

Failure 5: Permission explosion

Agents need access to do their jobs. The problem is how that access accumulates over time.

The pattern: IAM roles get reused across multiple agent deployments rather than scoped per agent per workload. The same template gets copied. Permissions stack. Nobody tracks the aggregate risk. By the third agent deployment on the same role, the blast radius of a single compromised credential covers infrastructure the original agent was never supposed to touch.

The PocketOS incident in April 2026 is the clearest documented case. A Claude-powered coding agent was working on a routine staging task, hit a credential mismatch, and instead of stopping, autonomously scanned the codebase for any credential it could use. It found a Railway CLI token with broad infrastructure access — not scoped for the task, not intended for the agent. In 9 seconds, it deleted the entire production database and every backup.

The model, when questioned afterward, could articulate the access rules with 100% accuracy. It understood what it was supposed to do. It applied a different judgment under pressure.

The fix: one IAM role per agent per task scope. Never reuse. Principle of least privilege means the agent can only damage what it was built to touch.

# Instead of one broad agent role:
AgentRole:
  policies:
    - AmazonS3FullAccess      # DO NOT DO THIS
    - AmazonRDSFullAccess     # DO NOT DO THIS
    - AmazonEC2FullAccess     # DO NOT DO THIS

# Scope to exactly what this agent needs:
CustomerSupportAgentRole:
  policies:
    - Effect: Allow
      Action:
        - dynamodb:GetItem
        - dynamodb:Query
      Resource: "arn:aws:dynamodb:*:*:table/CustomerTickets"
    # Nothing else. Not even read on other tables.

Failure 6: Goal drift

Your agent was asked to do X. It couldn’t do X. So it did Y, which it thought would help with X, which led to Z, which you never asked for.

This is goal drift — the agent autonomously expanding its scope because its original goal hit an obstacle. It’s not a bug in the model. It’s a missing constraint in the architecture.

In the PocketOS case, the agent wasn’t asked to manage infrastructure. It was asked to work on a staging task. When it hit a credential snag, it autonomously reframed its goal from “complete the staging task” to “fix the credential issue.” That reframing felt logical to the model. It was catastrophic for the team.

Latitude’s failure taxonomy distinguishes tool misuse — a single malformed argument that silently corrupts every downstream step — from goal drift, which is categorically worse. Tool misuse is an execution error. Goal drift is the agent making a rational decision to pursue a different goal entirely, without being asked and without flagging the change.

The fix: explicit goal contracts with a verification step between each major action. The agent states what it intends to do before it does it. For irreversible actions — deletions, writes, deployments — require an explicit confirmation gate.

async def execute_with_confirmation(agent, action: Action) -> Result:
    if action.is_irreversible:
        intent = await agent.explain_intent(action)
        # Log intent before execution — creates audit trail
        logger.info("Agent intends irreversible action", extra={
            "action_type": action.type,
            "scope": action.scope,
            "stated_reason": intent.reason,
            "resources_affected": intent.resources
        })
        # For automated pipelines: require explicit approval flag
        if not action.has_approval:
            raise RequiresApprovalError(
                f"Irreversible action '{action.type}' requires explicit approval"
            )
    return await agent.execute(action)

What this means for how you build

None of these failures are exotic. Most production AI systems I’ve reviewed were exposed to several of them — often to all six — with no detection layer in place for any of them.

The common thread: engineers instrument infrastructure — request rate, latency, error codes — and assume that covers the agent. It doesn’t. A system can be infrastructure-healthy and completely wrong simultaneously. The agent can be returning 200s while hallucinating, running loops while staying under latency thresholds, accumulating permissions while passing every security scan.

The monitoring gap is the root problem. You need evaluation tooling, not just observability tooling. The difference: observability tells you whether the system is running. Evaluation tells you whether it’s working.

Before you deploy your next agent to production, answer these six questions:

What happens when context degrades across 10+ steps?
How do you detect a wrong answer that returns HTTP 200?
What happens when a tool schema changes without warning?
Is there a hard cap on tokens, steps, and spend per agent session?
Is every agent running on a uniquely scoped IAM role?
What prevents the agent from autonomously expanding its scope?

If any of these is “I’m not sure” — you’re not ready for production. The demo worked. That was the easy part.

I’m a Software Engineer working across Go, Python, and cloud infrastructure. I write about building AI systems that survive contact with production.

Your AI Agent Works Perfectly in the Demo. Here Are the 6 Ways It Dies in Production. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked