Behind the Curtain: Why the Most Successful AI Apps are Actually Code-First.
We didn’t start with a big strategy. We just wanted to move faster.
We had APIs, Swagger specs, and a lot of repetitive work; validation, mock data, test payloads. So thought, “Why not let the LLM handle it?” and started a POC on it. It sounded right. And honestly, in the beginning, it worked. We gave it the spec, asked LLM to generate payloads, validate inputs, even simulate flows. The output looked clean. Demos went smoothly. Everyone was impressed.
Then we tried to use it in a real workflow. That’s where things started getting messy. The system didn’t crash. Which would have been easier. Instead, things failed in small ways. One request would pass, another one would fail. Same structure, same API, but slightly different values. Logs didn’t show anything obvious. No clear error pattern. Just inconsistency.
One example stayed with me. We were generating mock data for a dispute API. The LLM would read the OpenAPI spec and produce request payloads. At first glance, everything looked fine. But when we actually sent those requests: Some fields didn’t match exact formats. Enum values were “close,” but not correct. A required field would randomly go missing. IDs that were supposed to match across calls didn’t line up. Nothing was completely wrong. But nothing was reliable either.
We tried fixing it the obvious way. Better prompts. More instructions. More examples. Stricter formatting rules. It improved things a bit. But not enough. Because the problem wasn’t the prompt. The problem was that we were expecting a probabilistic system to behave like a deterministic one. And that doesn’t work.
That’s when we changed the approach. We stopped asking the LLM to generate final outputs. Instead, we let code take control. We moved to a simple idea. Let code do what code is good at and LLM do only what it’s good at.

For the same mock data problem, the flow became very different. First, we parsed the OpenAPI spec using code. We extracted required fields, types, enums, everything. Then we mapped fields using simple logic.
If it’s a name → use faker. n If it’s an email → generate a proper email. n If it’s an enum → pick from exact values, no guessing. n If it’s an amount → keep it within a defined range.
No surprises. Only when something didn’t match like a weird custom field we asked the LLM to help. We keep LLM to not to decide everything just to fill gaps. Then we added strict validation before anything moved forward. If something didn’t match the schema, it was rejected immediately. No silent failures.
The difference was obvious. Mock data became consistent. Failures became predictable. Debugging became simple again. We were no longer guessing what the system would do.
The biggest change wasn’t technical but the approach. We stopped treating the LLM like the system. It’s not. we used It as just a component. A useful one, but not something you hand full control to.
LLMs are good at understanding messy input. They are not good at enforcing rules.
Systems need rules. That’s where code comes in. So now the way we think about it is simple. If something needs to be correct every time, it goes into code. If something needs interpretation or flexibility, we let the LLM handle it. That’s it.
LLM-first looked great in demos. Code-first worked in production. That’s the difference that mattered for us.