Three Ways an LLM on Your Warehouse Gets ‘Why Did Revenue Drop?’ Wrong-and How to Fix Each

digitado ⋅ 31 de May de 2026

Connecting a language model to your warehouse is a great demo and a bad diagnostic engine. Here are the three failures you’ll actually hit, each with a fix, in plain SQL.

Plug an LLM into your warehouse, type “why did revenue drop 8% last week?”, and you get a confident paragraph in forty seconds. It feels like the analyst bottleneck just disappeared.

Then you start checking the answers, and three problems show up-in order, each quieter than the last. None of them throws an error. Here’s each one on a real Postgres example, and what fixes it.

Error 1 — The SQL runs, but it’s not the number you mean

The question: “What was our revenue last month?”

What the model writes:

SELECT SUM(amount) AS revenue
FROM orders
WHERE created_at >= date_trunc('month', now()) - interval '1 month'
  AND created_at <  date_trunc('month', now());

Valid SQL. Runs clean. Returns a number. And it’s wrong, because orders contains things “revenue” doesn’t:

cancelled and refunded orders are still in there
test orders (is_test = true) are still in there
amount mixes currencies — AED and EGP summed into one meaningless total
created_at counts orders that were placed last month, not delivered — revenue you may never collect

The model didn’t know any of this, because the definition of “revenue” for your business doesn’t live in the schema. It lives in a dbt model nobody documented, or in the head of the analyst who left. So the model guessed, and the guess executed cleanly.

That’s the dangerous part. A wrong number that errors out, you catch. A wrong number that returns, you ship to your CEO.

The fix — pin the definition once, never let the model touch raw tables.

Define the metric the business actually means, as a view (or a dbt model), and point the model at that:

CREATE VIEW finance.net_revenue_orders AS   -- one row per delivered order
SELECT
  o.order_id,
  date_trunc('day', o.delivered_at)  AS day,            -- delivered, not placed
  o.city,
  o.gmv_usd                          AS net_revenue_usd -- currency-normalized upstream
FROM orders o
WHERE o.status = 'delivered'      -- not placed, not cancelled
  AND o.is_test = false;          -- no test traffic

One row per delivered order — this single view is the only source for every query below. Now “what was revenue last month” becomes SELECT SUM(net_revenue_usd) FROM finance.net_revenue_orders WHERE day >= … — exactly one definition of revenue, the one finance agreed on. This is what a semantic layer is for: the model translates the question, but the math runs on definitions you control. Failure stops looking like a plausible wrong number and starts looking like an error if a column doesn’t exist — which is the kind of failure you want.

The same trap, one level deeper: the fan-out join. Join orders to order_items (one order, many items) and each order gets counted once per item — so SUM(net_revenue_usd) adds it two or three times over. No error, just an inflated total. Pinning the view fixes this too.

Error 2 — Ask why, get a story instead of a diagnosis

The question: “Why did revenue drop 8% last week?”

What the model gives you: a paragraph. “Revenue is down because conversion fell on mobile, likely driven by the checkout change on Tuesday, with a secondary effect from lower paid traffic.”

It sounds like an analyst. It is not a diagnosis. The model didn’t decompose anything or test anything — it generated the most plausible-sounding explanation given the words in your question. Sometimes that story is right. Often it points at the wrong cause, and you can’t tell which time you’re in.

A real diagnosis is a search, not a story. And the search is just SQL.

Fix, step 1 — decompose the metric and find which factor moved.

Revenue ≈ orders × average order value. Before guessing why, find out what:

SELECT
  date_trunc('week', day)          AS week,
  COUNT(*)                         AS orders,
  SUM(net_revenue_usd)             AS revenue,
  SUM(net_revenue_usd) / COUNT(*)  AS aov
FROM finance.net_revenue_orders
WHERE day >= date '2026-05-11' AND day < date '2026-05-25'
GROUP BY 1
ORDER BY 1;

Orders barely moved. The whole drop is in basket size. You haven’t found the cause yet, but you’ve cut the problem in half — this is not a traffic or conversion problem, so the model’s “mobile conversion” story was wrong on step one.

Fix, step 2 — separate rate from mix (the trap that catches everyone).

Your AOV can fall while every city’s AOV holds steady — if your order mix shifted toward cheaper markets. Same paradox that makes an aggregate move in the opposite direction of all its parts.

SELECT
  city,
  COUNT(*) FILTER (
    WHERE day >= date '2026-05-11' AND day < date '2026-05-18') AS orders_wk1,
  COUNT(*) FILTER (
    WHERE day >= date '2026-05-18' AND day < date '2026-05-25') AS orders_wk2,
  AVG(net_revenue_usd) FILTER (
    WHERE day >= date '2026-05-11' AND day < date '2026-05-18') AS aov_wk1,
  AVG(net_revenue_usd) FILTER (
    WHERE day >= date '2026-05-18' AND day < date '2026-05-25') AS aov_wk2
FROM finance.net_revenue_orders
WHERE day >= date '2026-05-11' AND day < date '2026-05-25'
GROUP BY city
ORDER BY orders_wk2 DESC;

If per-city AOV is flat but a low-AOV city’s share of orders jumped, your “AOV dropped” is really “the mix shifted” — and the fix is a marketing/supply question, not a pricing one. The two diagnoses lead to opposite decisions. The decomposition tells them apart; the story doesn’t.

Fix, step 3 — only now ask why, and test it. Once you’ve localized the drop to a factor and a segment, you have a hypothesis (“a promo in the cheap-AOV city pulled mix down”). That you confirm with the data or an experiment — not with a paragraph.

This is the whole difference: the model free-associates one cause; the decomposition enumerates and isolates them. Three queries, fully reproducible, and you actually know.

Error 3 — Ask the same question twice, get two different answers

The test: run “why did revenue drop last week?” on Monday. Run it again Wednesday. Compare.

You can get two different numbers from the same question on the same data — even with the metric pinned. The cause is the model itself. Even at temperature zero, large models don’t reliably return identical output: floating-point addition isn’t associative, so different batch sizes and reduction orders across requests produce slightly different roundings, and providers document only best-effort determinism. The SQL it generates can shift run to run — a different join, a different filter — and out comes a different number.

(If the metric isn’t pinned, it gets worse — Monday’s run sums one definition, Wednesday’s another — but that’s Error 1 again. The failure unique to this one is that even a perfect, pinned query still sits on top of a non-deterministic model.)

For a creative task, who cares. For a number that drives a budget decision, this is disqualifying — you don’t have a measurement, you have one draw from a distribution. And the first time a stakeholder catches two different numbers for one question, the tool is dead, because nobody can tell which one was right.

The fix — make the metric deterministic code, and let the LLM only narrate it.

The number must come from a fixed, versioned definition — the view from Error 1, or a dbt model under version control — so it’s byte-identical every run, by anyone:

-- same query, same data => same answer, every time, forever
SELECT SUM(net_revenue_usd)
FROM finance.net_revenue_orders
WHERE day >= date '2026-05-11'
  AND day <  date '2026-05-18';

The LLM never computes the number. It picks which pinned metric and window you meant, a deterministic layer runs the math, and the model writes the sentence around the result. Run it a hundred times — same number, same decomposition. The narrative can vary; the measurement can’t.

So where does the LLM actually belong?

Not in the engine. On top of it.

The model is genuinely good as the interface and the narrator — wired over a deterministic core, not in place of one:

Translate the question, don’t invent the answer. Natural language in → “you want net revenue, weekly, by city” out. The math runs on your pinned definitions.
Suggest hypotheses, don’t confirm them. “Here are eight things that could move this metric” is a great use of a model that’s read everything. Which one is true is the decomposition’s job, and the experiment’s.
Narrate the verified result. Once the SQL above produced a real, reproducible answer, having the model turn it into a clear paragraph for a stakeholder is genuinely useful. The narrative was never the problem — the narrative standing in for the analysis was.

Two questions cut through any “AI analyst” demo:

Show me the same question, three times, on three days. If the numbers move, it’s a generator, not a measurement.
When it’s wrong, how do I find out? If the answer is “the SQL would error,” good. If it’s a shrug, you’ve found the plausible-wrong-number machine.

The methods that pass that bar aren’t new or glamorous: pin the metric, decompose it, separate rate from mix, then test the cause. An LLM makes all four faster and friendlier. It just can’t replace them — and the moment you ask it to, it stops measuring and starts storytelling.

Three Ways an LLM on Your Warehouse Gets ‘Why Did Revenue Drop?’ Wrong-and How to Fix Each was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked