Deterministic Shells, Probabilistic Cores: The Architecture Pattern Behind Every Reliable Agent
The secret to reliable AI agents isn’t better prompts. It’s wrapping a probabilistic core inside a deterministic shell — and knowing exactly where the boundary sits.
The Pattern Nobody Named
Every architecture decision I’ve made over the past year — skill servers, model routing, meta-tools, context engineering, the 9-tool framework — follows the same underlying principle. I never named it explicitly, but it’s been the thread connecting everything.
Here it is: the LLM is the core, not the system. The system is deterministic. The core is probabilistic. And the boundary between them is the most important design decision in your entire agent.
An LLM is, fundamentally, a next-token predictor. It’s probabilistic by nature. Same input, different runs, different outputs. It hallucinates. It picks wrong tools. It goes on tangents. It forgets constraints mid-conversation. Anyone who has shipped an agent knows this — the model is smart, but it’s unreliable in ways that are impossible to fully predict.
The instinct of most developers is to fix this inside the model. Better prompts. More examples. Temperature tweaks. Retry logic.
That instinct is wrong.
The fix is outside the model. You wrap the unpredictable core in a predictable shell — deterministic code that controls what goes in, what comes out, and what the model is allowed to do at each step. The model reasons. The shell constrains.
This is the pattern behind every agent I’ve built that survived production. Let me walk through each layer.
Layer 1: Deterministic Input Assembly
The first job of the shell is controlling what the model sees.
A context window is not a suggestion box. It’s an engineered space where every token should earn its place. I wrote a full article on this (Context Engineering: What I Changed in My Agents), but the key insight for the deterministic shell pattern is this: what enters the context window should be assembled by deterministic code, not by the model’s whims.
Dynamic system prompts. The system prompt isn’t a static text file. It’s code that assembles the right instructions based on current state — time of day, user tier, loaded skills, active constraints. The model doesn’t decide what rules it follows. The shell decides.
def build_system_prompt(user, agent_state, current_time):
sections = [
load_base_identity(), # Always present
build_time_context(current_time), # "It's Saturday 2am"
build_user_context(user), # Tier, preferences, history
build_active_skills(agent_state), # Currently loaded capabilities
build_constraints(user.tier), # What NOT to do
]
return "nn".join(sections)
This is deterministic. Same user + same state + same time = same system prompt. Every time. The model receives a consistent, controlled briefing — not a generic prompt that hopes the model figures out the context.
On-demand skill loading. In the 9-tool framework, the agent starts with a minimal toolkit. Skills — with their rich tool descriptions, workflow guidelines, and operational context — are loaded only when needed. The decision to load a skill involves the model (it calls load_skill), but what gets loaded is deterministic. The SKILL.md content is fixed. The tool descriptions are fixed. The workflow guidelines are fixed. And the skills given to the agent are choosen, give only what is needed.
The model operates inside a context that was assembled by code. Not by itself.
Conversation context management. History doesn’t just accumulate — it’s managed. Recent messages stay in full. Older messages get summarized. Large tool outputs get replaced with compact traces. This is deterministic housekeeping that prevents the model from drowning in its own history.
def manage_context(messages, max_recent=8):
if len(messages) <= max_recent:
return messages
old = messages[:-max_recent]
summary = summarize_messages(old) # Deterministic compression
recent = messages[-max_recent:]
for msg in recent:
if len(msg.content) > 2000:
msg.content = truncate_with_summary(msg.content)
return [{"role": "system", "content": f"Previous context: {summary}"}] + recent
The model doesn’t manage its own memory. The shell does.
Layer 2: Deterministic Routing
The second job of the shell is deciding which model runs and what tools are available — before the model ever sees the query.
Model routing. I wrote a full article on this (Your Users Will Never Pick the Right Model). The router sits between the user’s query and the LLM call. It analyzes the query and routes to the appropriate model — Haiku for price checks, Sonnet for general work, Opus for deep analysis.
async def route(query: str, context: dict) -> str:
# Layer 1: Hard rules (instant, deterministic)
if is_price_check(query):
return "gpt-4o-mini"
if is_greeting(query):
return "gpt-4o-mini"
# Layer 2: Skill-based routing
skill = context.get("active_skill")
if skill and skill.get("preferred_model"):
return skill["preferred_model"]
# Layer 3: Context signals
if context.get("conversation_turns", 0) > 15:
return "claude-sonnet"
# Layer 4: LLM classification (only for ambiguous cases)
return await classify_complexity(query)
The critical insight: 80% of routing is deterministic rules. Only 20% — the genuinely ambiguous cases — uses an LLM classifier. And even that classifier is constrained to return one of four categories. The model doesn’t choose its own model. The shell does.
Skill-based tool scoping. When a skill is loaded, it defines exactly which tools are available and exactly how they should be used. The model doesn’t get to browse a catalog of 40 tools and pick. It gets 9 core tools + whatever the current skill provides — with rich, unambiguous descriptions.
This is deterministic scoping: the model operates within a predefined capability boundary. It can reason freely within that boundary, but it can’t escape it.
skill: financial_analysis
preferred_model: claude-opus
max_latency_ms: 30000
tools:
- get_stock_snapshot
- get_sentiment
- generate_report
constraints:
- "Always include risk disclaimers"
- "Never recommend specific trade actions"
The skill defines the box. The model thinks inside it.
Layer 3: The Probabilistic Core (Let It Cook)
Here’s where some developers get the pattern wrong. They read “deterministic shell” and think “deterministic everything.” They try to script every step, every tool call, every response format.
That kills the agent.
The whole point of using an LLM is that it reasons. It synthesizes. It handles novel situations. It makes connections you didn’t anticipate. That’s the probabilistic core, and it’s the valuable part.
The pattern isn’t “eliminate non-determinism.” It’s “contain non-determinism in the right place.”
Inside the shell, the model should be free to:
- Decide which tools to call (from the scoped set available)
- Decide in what order (within the skill’s workflow guidelines)
- Synthesize results in natural language
- Adapt its tone and depth to the conversation
- Handle unexpected user queries with common sense
The model is the reasoning engine. It’s good at reasoning. Let it reason.
What it should NOT be free to do:
- Choose which model runs (that’s the router)
- Decide what context to load (that’s the system prompt builder)
- Manage its own memory (that’s the context manager)
- Access tools outside its current scope (that’s the skill system)
- Skip confirmation for irreversible actions (that’s the guardrail layer)
The boundary is clear: the shell controls the environment. The model controls the reasoning within that environment.

Layer 4: Deterministic Output Processing
The model produced a response. Now the shell takes over again.
Tool call validation. When the model decides to call a tool, the shell validates the call before execution. Does the tool exist in the current scope? Are the parameters valid? Is the call allowed by the current skill’s constraints?
def validate_tool_call(call, active_skill, user):
# Tool must exist in current scope
if call.tool_name not in get_available_tools(active_skill):
return False, f"Tool '{call.tool_name}' not in current scope."
# Confirmation required for high-impact tools
if call.tool_name in active_skill.get("confirmation_required", []):
return "needs_confirmation", call
# Parameter validation
if not validate_params(call.tool_name, call.arguments):
return False, "Invalid parameters."
return True, call
The model wanted to call a tool. The shell decides if it’s allowed.
Output sanitization. Tool results get cleaned before re-entering the context. Prompt injection attempts in tool outputs get filtered. Large responses get truncated. This is the security layer I described in my agent security article — but it’s also an architectural choice. The shell controls what the model sees at every step.
Response formatting. The final response might need structural processing — injecting disclaimers, formatting data, enforcing length constraints. The model generates natural language. The shell applies the template.
Post-run learning extraction. After the run completes, deterministic code extracts operational learnings. “This tool timed out.” “The user said the response was too verbose.” “The iteration budget was nearly exhausted.” These become agent memory — stored by the shell, injected by the shell in future runs, never managed by the model itself.
The Design Principle
Let me be clear about something: the probabilistic core is the magic. It’s the whole reason we’re building agents instead of writing scripts.
An LLM can synthesize five conflicting data sources into a coherent analysis. It can handle a user query it’s never seen before and figure out the right approach. It can make connections between a portfolio risk signal and a calendar scheduling decision that no rule engine would ever anticipate. The reasoning capability of these models is genuinely brilliant, and it gets better every few months.
The point of the deterministic shell is not to limit that brilliance. It’s to give it the right stage.
Think about it this way: the question is never if your agent will do something unexpected. It’s when. That’s what probabilistic means — the model will surprise you. Sometimes brilliantly. Sometimes catastrophically. And you can’t tell which one it’ll be until it happens.
The shell is what makes the brilliant surprises possible and the catastrophic ones survivable. When you control the inputs — what skills are loaded, what tools are available, what context the model sees — you’re not constraining the LLM. You’re defining the arena where its reasoning can shine. When you validate the outputs — tool call authorization, response sanitization, human-in-the-loop for irreversible actions — you’re not clipping its wings. You’re making sure a single bad decision doesn’t burn the house down.
Inside the right framework, let the model live. Let it reason freely. Let it pick unexpected tool combinations. Let it synthesize in ways you didn’t anticipate. That’s the whole point. That’s where the value is.
But you define that framework. You choose which skills are loaded. You decide which tools are in scope. You control what enters and exits the context window — including the intermediate results between tool calls. You set the guardrails for irreversible actions. You build the routing, the validation, the sanitization.
The engineering isn’t in making the model better. It’s in knowing exactly where to place the frame, and building it well enough that the model can do its best work inside it.
The agents that work in production aren’t the ones that restrict their LLM the most. They’re the ones where the framework is so well-defined that the LLM’s probabilistic brilliance becomes a feature, not a risk.
Deterministic shells. Probabilistic cores. The shell defines the arena. The core performs in it.
That’s the pattern.
Thanks for reading! I’m Elliott, a Python & Agentic AI consultant and entrepreneur. I write weekly about the agents I build, the architecture decisions behind them, and the patterns that actually work in production.
If this pattern clarified how you think about agent architecture, I’d appreciate a few claps 👏 and a follow. And if you’ve found a different way to tame the probabilistic beast — I’d love to hear about it in the comments.
Deterministic Shells, Probabilistic Cores: The Architecture Pattern Behind Every Reliable Agent was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.