Agent Harnessing: The Non-Model Infrastructure That Makes AI Agents Actually Work

1.0 Introduction

The model is not the hard part.

Every frontier model has a story about their latest capabilities. Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro are genuinely remarkable reasoning engines; each pushing further on context, multimodal perception, and agentic tool use. Behind each of these extraordinary reasoning engines, however, is a non-model infrastructure layer working silently to make the model useful in practice.

Everyone building AI agents have access to the same frontier models through the same APIs. Often times, what separates a production-grade agent from one that fails silently on the third real workload is not the model. It is the infrastructure surrounding it.

That infrastructure is the harness.

Agent harnessing is the non-model layer that gives an agent persistent memory, tool connectivity, controlled reasoning loops, verification, and the ability to coordinate with other agents. The model decides what to do. The harness determines whether it actually does it, reliably, at scale, across the full range of things that go wrong.

This article covers what that infrastructure contains and a brief introduction on how to implement it.

1.1 The Formula That Needed Updating

Image generated using ChatGPT

A popular way to think about agent architecture has been:

Agent = LLM + Memory + Planning + Tool Use

That still works for simple deployments. But a more complete picture comes from Agentic Artificial Intelligence (Ehtesham et al., 2026), which proposes a unified taxonomy breaking agents into six dimensions:

Agent = Perception + Brain + Planning + Action + Tool Use + Collaboration

I find this a significantly better starting point than the older formula. Each dimension has its own failure modes, its own design decisions, and its own place in the harness. The older formula bundles too much together and hides exactly the problems that bite hardest in production.

That said, the taxonomy I use throughout this article is a personal adaptation of the original. I elevate Memory as a distinct, standalone dimension rather than folding it into the Brain, which is where the paper positions it. That choice reflects a practical conviction I have built from working with production systems; memory is its own engineering discipline with its own tooling, its own failure modes, and an implementation surface that is genuinely independent from the reasoning engine. Collapsing it into Brain understates how much work it actually demands. My adapted framing becomes:

Agent = Perception + Brain + Memory + Planning + Action + Collaboration

Perception is how the agent receives and preprocesses its inputs. Claude Opus 4.6 and GPT-5.4 both handle text, images, audio, and structured data natively, so this is no longer just a text problem. What gets passed into the context window, in what form, and at what level of compression is a harness decision, not a model decision.

Brain is the reasoning engine. Agents today are not single-model systems. Claude Haiku 4.5 handles fast, low-complexity work like extraction and classification. Claude Sonnet 4.6 handles the bulk of reasoning and coding. Claude Opus 4.6 or GPT-5.4 steps in for deep orchestration and high-stakes decisions. The harness decides which model handles which task.

Memory has grown into its own engineering discipline. More on this in the next section.

Planning now comes in two flavours. The ReAct loop reasons and acts at every step, which is flexible but token-heavy. Plan-and-execute separates the thinking from the doing, breaking the task into a dependency graph upfront and executing steps in parallel where possible. The harness decides which pattern fits the task.

Action has moved on from individual JSON function calls. The dominant pattern today is code-as-action: the agent writes a script that calls multiple tools, handles retries in code, processes the results, and returns a single clean output. This keeps the context window lean and cuts costs significantly on tool-heavy pipelines.

Collaboration is now handled at the protocol level through MCP and A2A rather than baked into individual frameworks. More on this in the multi-agent section.

The reason I find this six-part framing more useful than the older four-part formula is straightforward: it forces one to reason about failure at the right granularity. When an agent goes wrong in production, knowing which of the six dimensions failed tells exactly where in the harness to look. The older formula collapses that precision into ambiguity, and ambiguity is expensive when something breaks at two in the morning.

1.2 What the Harness Is Actually Made Of

Agent harness has six distinct components. Each has its own failure modes and its own implementation surface.

Context management is what controls what the model reasons over. Most agent failures I have seen are not only the model giving a wrong answer; they are the model reasoning over the wrong information because the harness injected too much, too little, or the wrong kind of context. The harness is responsible for selecting what to retrieve, compressing what exceeds the context window, and filtering out noise before it reaches the model.

Memory gives the agent continuity across steps and sessions. Without it, every session starts from scratch.

Tools give the agent the ability to act in the world rather than just produce text.

Control flow governs the reasoning loop: when to continue, when to stop, and what to do when something goes wrong.

Verification independently checks whether the agent’s output actually meets the required standard before the task is closed.

Coordination manages how multiple agents communicate and divide work.

None of these are model capabilities. All of them are harness responsibilities.

2.1 Implementing Context Management: The Harness Starts Before Reasoning Begins

Context management is upstream of everything. What enters the context window determines what the model can reason about.

The four operations harness should implement are:

Select: retrieve only what is relevant to the current step. In a RAG pipeline, this means running a semantic search over knowledge base and injecting the top-ranked chunks, not the entire document store. For instance, a legal review agent searching a 500-document repository should receive the 6 most relevant contract clauses, not all 500 documents.

Compress: when context is long, summarise it rather than truncate it blindly. Anthropic’s Agent SDK provides automatic compaction, but for tasks spanning multiple sessions, the harness should also write a structured progress file at the end of each session. This file records what was completed, what is pending, and what the next session should prioritise. Compaction alone does not preserve this kind of structured state.

Isolate: Inputs should be filtered at the tool boundary before they reach the model. A database query returning 10,000 rows should be aggregated to a summary before injection. This is both a cost control and a prompt injection defence. When an agent reads web pages or user-submitted documents, adversarial instructions in that content can redirect the agent’s behaviour unless the harness sanitises inputs first.

Write: externalise information the agent will need later. Scratchpads, progress files, and structured state objects should be written by the harness at each checkpoint, not left to the model to reconstruct from memory.

2.2 Implementing Memory

Memory should be implemented as a layered system, not a single database. Each layer handles a different scope.

Working memory is the active context window. It is managed by the context engineering operations above.

External semantic memory uses a vector database to store facts extracted from past interactions and retrieve them when relevant.

For example, Mem0 is an open-source library that acts as a dedicated memory layer sitting between an agent and its underlying language model. Mem0 converts conversations into atomic memory facts, stores them in a vector store (Qdrant, pgvector, or Chroma), and retrieves them via similarity search at the start of each reasoning cycle.

The implementation involves three steps. First, instantiate a Memory object, which by default uses an in-memory store suitable for development. Second, call memory.add(messages, user_id=user_id) after each interaction; Mem0 extracts the facts and writes them to the configured store. Third, call memory.search(query=message, filters={“user_id”: user_id}, top_k=5) at the start of each new session to retrieve the most relevant stored facts. The retrieved facts are then injected into the system prompt before the model call.

Image generated using ChatGPT

Persistent cross-session memory: Involves storing information as files in a managed directory, reading and writing those files across entirely separate conversations. The agent reads, writes, and updates these files through tool calls. This is the right layer for tasks that run over multiple days, where the agent needs to maintain a coherent project model across sessions.

2.3 Implementing the Control Loop

The agent control loop runs five stages in sequence: Perceive, Reason, Plan, Act, Observe. The model handles the reasoning. The harness governs the loop itself.

Perceive → Reason → Plan → Act → Observe → [repeat]

The most important harness decision here is termination. A loop without stopping conditions runs until it exhausts your budget. Layer four stopping conditions:

  • Maximum turns: involves setting an explicit turn cap matched to the task. A conversational agent handling a single user request typically finishes in one to three turns. A coding agent working through a complex multi-file refactor may need twenty or more. The right cap is specific to task, not a universal number. Set low enough to catch runaway loops and high enough not to interrupt legitimate work.
  • Token budget: involves setting a per-run token limit. Single-agent loops consume roughly four times the tokens of equivalent chat interactions. Multi-agent systems push that to fifteen times or more.
  • No-progress detection: if the last two or three iterations produced no new information or no change in state, exit loop and escalate.
  • Goal-achievement check: involves verifying at each iteration whether the task objective has actually been met, not just whether the agent believes it has.

2.4 Implementing Tools

The core value of MCP involves: implement a tool once as an MCP server and every compatible agent can call it without additional integration work. What matters more for harness design is how MCP changes the way tool interactions are structured inside the loop.

The most useful pattern for tool-heavy agents is what Anthropic calls code-as-action. Rather than routing individual MCP tool calls sequentially through the agent loop which floods the context window with intermediate results; the agent writes a short script that calls multiple tools, joins the results, handles retries in code, and returns a single clean output. For instance, an agent that needs to query three APIs, filter for anomalies, and return a summary should write a Python script to do all of that and execute it once. The model sees the final output, not the raw result of each intermediate step. This reduces context bloat significantly on anything involving more than two tool calls in sequence.

Prompt injection defence at the tool boundary should also be implemented. Every piece of content entering through a tool call; web pages, user-uploaded documents, external API responses — is untrusted input that could contain adversarial instructions. The harness sanitises that content before it reaches the model, using boundary markers and pattern stripping at the tool output layer.

2.5 Implementing Multi-Agent Coordination

In practice, multi-agent systems add real coordination overhead and that overhead needs to be justified by something a single-agent approach genuinely cannot handle. For instance, parallel execution across subtasks, workloads that exceed a single context window, or tasks that benefit from meaningfully different model capabilities at different stages. If none of those conditions apply, multi-agent coordination is not really worth it.

When multi-agent architecture is the right call, the harness becomes responsible for something a single-agent system never has to handle: the structural relationship between agents. This includes how agents discover each other, how work is delegated and returned, how results from multiple agents are synthesised, and what happens when one agent in the chain fails or times out. None of this is handled by the models themselves. All of it lives in the harness.

In other words, the harness wires this together across three distinct layers.

Orchestration logic sits in the harness, not in any individual agent. The orchestrator agent receives a task, decomposes it into subtasks based on criteria defined in the harness, and delegates each subtask to the appropriate specialist. That routing logic; which agent handles which subtask, under what conditions, with what fallback; is a harness decision.

Agent-to-agent handoffs are governed by the A2A protocol. The harness registers each agent’s Agent Card at startup, defining its capabilities, accepted inputs, and expected outputs. When the orchestrator delegates, it queries the registry, selects the right specialist, and hands off a structured task payload via the A2A client. The specialist processes the task, returns a structured result, and the harness feeds that result back into the orchestrator’s reasoning cycle. Crucially, the orchestrator does not need to know which tools the specialist used or how it was implemented. The harness enforces that separation.

Result synthesis is a harness responsibility that is easy to overlook. When two or three specialist agents return results in parallel, something has to combine them coherently before the final response is produced. That synthesis step; whether it is a second model call, a deterministic merge function, or a structured aggregation, is defined in the harness; not improvised by the model at runtime.

Human-in-the-loop gates are the final coordination responsibility that should be encoded in the harness. The model should not decide when to involve a human. The harness should define this explicitly: which action classes require approval before execution, what the escalation path is when an agent cannot resolve an ambiguity, and what happens if approval does not arrive within a defined timeout.

2.6 Verification

Often times, models report task completion based on their own assessment of their output. That assessment is unreliable. Recently, Anthropic’s coding agent research found that agents would consistently mark features as complete without verifying they worked in the actual running application. The agent had written the code, run some unit tests, and declared success. The feature was broken in ways that only became visible when a real user tried to use it.

The solution was not to instruct the agent to be more careful. It was to add browser automation through Puppeteer, forcing the agent to verify features the way a user would: opening the application, clicking through the relevant flow, and checking the actual state change. The harness added an independent verification step that the agent could not short-circuit.

This is the evaluator-optimiser pattern: one agent or call generates an output, a separate evaluation step checks it against defined criteria, and structured feedback is fed back into the generation step if the criteria are not met. The loop continues until the output passes or a maximum iteration count is reached.

The evaluation step can be:

  • A separate model call with an explicit rubric and scoring instructions
  • An automated test suite that runs against the output programmatically
  • A structured evaluation agent with access to domain-specific validation tools
  • Browser or UI automation that verifies the output in an actual running environment

3.0 Conclusion

The model is the reasoning engine. The model gets the benchmarks, the harness earns the reliability.

Memory without structure loses state. Context without selection generates noise. Tool calls without verification produce confident wrong answers. Loops without stopping conditions exhaust budgets. Multi-agent systems without protocols create brittle custom bridges.

Agent harnessing is that infrastructure. It is the memory system that ensures an agent picks up where it left off instead of starting over. It is the context layer that ensures the model is reasoning over the right information rather than noise. It is the verification loop that catches errors before they propagate. It is the stopping conditions that prevent runaway loops. It is the routing logic that balances capability against cost. It is the protocol layer that makes agents composable across vendor boundaries. It is the observability stack that makes failures diagnosable.

Everyone building have access to the same frontier models through the same APIs. The differentiator is harness maturity: how systematically and deliberately the non-model infrastructure has been designed, instrumented, and improved.

Build the harness. Everything else follows from it.

4.0 References

Ehtesham, A., Hassan, A., Abu-Salih, S. and Abuhmed, T. (2026) Agentic Artificial Intelligence: Architectures, Taxonomies, and Evaluation of Large Language Model Agents, arXiv:2601.12560. Available at: https://arxiv.org/abs/2601.12560

Research and engineering references: Anthropic engineering blog, OpenAI system cards and field reports, Linux Foundation Agentic AI Foundation documentation, arXiv 2601.12560 (January 2026), Mem0 ECAI 2025 paper, LangChain context engineering documentation (July 2025).


Agent Harnessing: The Non-Model Infrastructure That Makes AI Agents Actually Work was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked