LLM Evals Are Not Enough: The Missing CI Layer Nobody Talks About

For a while, “just add evals” felt like the obvious answer to shipping LLM systems responsibly.

That made sense. OpenAI describes evals as a way to make LLM applications more stable and resilient to code and model changes, while Anthropic frames evals as the mechanism for testing whether an AI system succeeds on a task at all. At the same time, tools like Promptfoo and DeepEval have pushed evaluations closer to the software engineering mainstream by explicitly supporting CI/CD workflows.

So, on paper, the problem looked solved: you run an eval suite, get scores, and check whether the model is good enough. In practice, that is not what happened.

What I keep seeing is that teams are getting better at producing eval results, but not necessarily better at making build decisions from them, and those two are not the same thing. An eval run can tell you a lot: it can produce pass rates, metric scores, safety findings, per-test detail. But CI does not care that a dashboard you are maintaining is informative, for example. CI needs something much harsher: a deterministic answer to a narrow question, like “should this build pass?”. And until you try to operationalize it, that sounds trivial.

The first problem is that evals are not naturally shaped like policy, but rather like measurements. A tool will tell you that one suite scored 0.84 on “groundedness”, another had a 92% pass rate, and a third failed on a handful of long-tail edge cases. Which is useful, of course, but what does your organization actually require? Is the 0.84 acceptable? Is the 92% acceptable on all suites, or only on non-safety suites? Is a one-point regression acceptable if the variance is normal, but unacceptable if it affects hallucination checks? The moment you ask those questions, you are no longer talking about “running evals.” You are talking about governance. That is where things start getting messy.

The second problem is that most real teams do not live in one clean eval ecosystem. OpenAI and Anthropic both emphasize that evaluations should be ongoing and tied to real production behavior, which is sensible, but teams rarely standardize perfectly around a single framework for long. Different groups use different tooling, different metrics, different naming conventions, and different report formats. Promptfoo supports CI for prompt and security testing. DeepEval supports CI for unit-style and end-to-end LLM testing. Both are valid and both solve real problems, but once multiple ecosystems exist inside one company, you inherit a new class of problem above the tool level: how to apply one consistent release standard across all of them.

That gap gets underestimated because dashboards make everything look more mature than it is. A dashboard can show trend lines, pretty summaries, and lots of reassuring numbers, but a merge pipeline is unforgiving. It has to cope with malformed result files, missing metrics, renamed tests, empty selections, bad regex filters, broken baselines, and edge cases when someone defines a relative regression check carelessly. In other words, the actual operational problem is not “can we evaluate this model?” The real problem is “can we trust the evaluation artifacts enough to let them block releases without creating chaos?”, which are very different standards.

I think this is why so many AI quality systems look good in demos but feel brittle in production. The demo assumes the eval ran correctly, the schema is stable, the metrics are present, and everyone agrees on what counts as failure. Production gives you the opposite. Someone changes a metric name, or uploads JSON in a slightly different shape. A baseline is missing, a test subset returns no rows, and nobody agrees whether that should pass silently or fail loudly. Suddenly, your “AI quality gate” is really just a collection of fragile assumptions taped to a build script.

If you make the gate too loose, regressions slip through. A model gets less grounded, less safe, or less consistent, but the build still goes green because nobody translated the organization’s standards into machine-enforceable rules. If you make the gate too brittle, engineering loses trust in it. People start bypassing the checks because the system blocks releases for configuration problems rather than real quality problems. Once that happens, the whole promise of eval-driven development starts to erode.

OpenAI’s recent writing on evals for agents makes the broader point clearly: the goal is to turn skills into something you can test, score, and improve over time. Anthropic makes a similar case that automated evals should support development before real users are exposed. I agree with both, but it feels like there is a missing middle layer between “we have test results” and “we can safely use them in release engineering.”

That missing layer is policy, not more evals, another benchmark, or another scorecard. By policy, I mean explicit, reviewable rules that say things like: this suite must stay above a given pass rate, this metric must not regress more than a defined amount from the baseline, these tags are advisory, and those tags are blocking. Once you think in those terms, the architecture becomes clearer, and the evaluation framework starts generating evidence. A separate policy layer interprets that evidence for CI.

That separation matters more than it first appears. It means teams can change models, prompts, and eval providers without rewriting their release standards every time. It means quality requirements stop living as tribal knowledge in Slack threads and start existing as versioned policy. It means warnings and errors can be treated differently on purpose rather than accidentally, and build decisions become auditable, which is exactly what you want when an LLM system starts affecting customer-facing behavior, internal automation, or safety-sensitive workflows.

The real missing layer is small and operational: a way to take whatever eval results already existed, interpret them through explicit rules, and give CI a fail-safe answer. Because once you are shipping LLM systems seriously, “we ran evals” is not enough, you need to know what happens next.

Liked Liked