Your AI Agent Works Perfectly in the Demo. Here Are the 6 Ways It Dies in Production.
Author(s): Vinamra Yadav Originally published on Towards AI. The demo worked perfectly. You ran it twenty times. You showed it to your team. You showed it to your CTO. Every prompt returned exactly the right output. Then you deployed it. Three days later, a customer reported that the agent gave them completely wrong information — confidently, without any error. Your logs showed HTTP 200s all the way down. Your monitoring reported zero errors. The agent had been silently hallucinating for 72 hours, and nothing in your infrastructure had noticed. This is not a model quality problem. The model was doing exactly what models do. This is an architecture problem — and it’s the problem nobody writes about, because it only becomes visible after you’ve already deployed. I’ve spent the last year building and reviewing AI agent systems in production. The failure taxonomy is consistent. There are six ways an AI agent dies in production, and almost none of them show up in a demo. The math that should terrify you Before the taxonomy, one number worth sitting with. If your agent achieves 85% accuracy per step — which is a good number, better than many production systems — and your workflow has 10 steps, the probability of completing that workflow successfully is 0.8⁵¹⁰ = 19.7%. In this simplified model — where steps are independent and success is binary — roughly eight out of ten workflows fail despite each individual step being “pretty good.” Real failure modes are messier than this, and steps are rarely fully independent. But the model captures the architecture problem accurately: multi-step workflows compound failure. The only way out is to build failure handling into every step, not just the last one. Now, the six failure modes. Failure 1: Context degradation In a multi-step agent workflow, the model doesn’t remember what happened two steps ago — you send it. Every API call includes the entire conversation history. And that history grows with every step. What engineers miss: context doesn’t just grow, it degrades. Datadog’s 2026 State of AI Engineering report documents the pattern precisely: the average token count in production agent workflows more than doubled year-over-year for median-use teams, and quadrupled for heavy users. As context grows, the original instruction becomes diluted — newer tool outputs and summaries crowd out the early reasoning, and the agent continues confidently on increasingly corrupted signal. When this drift surfaces, there is no owner to page, no baseline to compare against, no runbook to execute. It surfaces as a customer complaint. The agent doesn’t tell you this is happening. The outputs become subtly wrong in ways that are nearly impossible to detect without evaluation tooling. The pattern that makes it worse: engineers build agents that pass outputs between steps as plain text summaries. The model summarises step 3’s output, passes the summary to step 4, which summarises again for step 5. Each summarisation is a lossy compression. By step 8, you’re acting on a summary of a summary of a summary of the original instruction. The fix: preserve structured outputs between steps, not prose summaries. Use typed data contracts between agent steps rather than natural language handoffs. # Instead of this — lossy text handoffresult = agent.run(“Summarize what you found and pass it to the next step”)# Do this – structured contract between steps@dataclassclass StepResult: extracted_entities: list[str] confidence_scores: dict[str, float] raw_source_ids: list[str] # preserve provenance step_number: int Failure 2: Silent failures This is the one that keeps engineers up at night, because it’s the one you don’t know about. Traditional monitoring is completely blind to agent failures. An agent that hallucinates a confident wrong answer still returns HTTP 200. Latency stays normal. Error rate stays at zero. Your dashboards are green. Your Slack alerts are quiet. Latitude’s production observability research documents the pattern clearly: “Tool misuse is the most common agent-specific failure mode in production — and the most insidious: a single malformed argument at step 2 silently corrupts every subsequent step that depends on that output.” The agent calls a tool with incorrect arguments, selects the wrong tool for the task, or fails to handle a tool error and continues as if the call succeeded. The classic scenario: a customer support agent that answers questions about account status. In testing, all queries are clean, structured English. In production, queries are messy, multilingual, emotionally charged. The agent returns plausible wrong answers with normal latency and HTTP 200s. The only signal is a customer escalation — which arrives hours or days after the degradation began. The fix: add a lightweight LLM evaluator layer that scores every agent output before it reaches the user. Not a human in the loop — a small, fast model that checks three things: is this response relevant to the query? Does it contradict the source data? Does the confidence language match what the retrieval actually returned? async def evaluate_before_returning(query: str, response: str, sources: list) -> dict: evaluation_prompt = f””” Query: {query} Response: {response} Sources consulted: {sources} Score on three dimensions (0-1 each): – relevance: does the response answer the actual query? – grounding: is the response supported by the sources? – calibration: does the certainty language match source quality? “”” score = await fast_evaluator.run(evaluation_prompt) if score[“grounding”] < 0.7: raise AgentQualityError(“Response not grounded in retrieved sources”) return score Failure 3: Tool execution schema drift Your agent calls tools — APIs, database queries, internal services. Those tools change. When they change, your agent doesn’t know. This is the API version problem in disguise. An agent calling a tool doesn’t validate that the tool’s response schema matches what it was trained or prompted to expect. When a third-party API updates their response format, or when your internal service adds a required field, or when an OAuth token expires — the agent receives a malformed or empty response and, depending on how you’ve built it, either hallucinates a plausible-looking answer from the gap, or enters a retry loop. Datadog’s 2026 State of AI Engineering report puts a number on it: […]