Harness in AI Agents

digitado ⋅ 31 de May de 2026

A harness in AI agents is the runtime and control plane that turns a raw model into something that can actually do work. In Anthropic’s wording, an agent harness or scaffold is the system that enables a model to act as an agent by processing inputs, orchestrating tool calls, and returning results, while OpenAI describes the harness as the control plane around the model that owns the loop, tool routing, approvals, tracing, recovery, and run state. LangChain summarizes the same idea as “Agent = Model + Harness,” meaning the model provides intelligence and the harness supplies state, tools, feedback loops, and enforceable constraints.

Modern AI agents are not just Large Language Models (LLMs). A raw model alone cannot reliably execute workflows, call APIs, maintain state, recover from failures, coordinate tools, enforce permissions, or operate safely in production systems.

The missing layer between the LLM and the real-world execution environment is called the Harness.

The harness is one of the most important architectural concepts in modern agentic systems. Companies like Anthropic and OpenAI increasingly emphasize that the actual intelligence of an AI agent comes not only from the model itself, but from the surrounding orchestration runtime.

A harness transforms:

LLM → Autonomous Operational Agent

Without a harness, a model is simply generating text.

With a harness, the model becomes capable of:

Planning
Tool usage
Multi-step reasoning
Memory management
Task execution
Environment interaction
Error recovery
Human approvals
Security enforcement
Long-running operations
Agent collaboration
Evaluation and monitoring

The core idea

The cleanest mental model is that the agent loop is not the model itself. The model reasons, selects actions, and proposes outputs, but the harness decides when the model is called, which tools are available, how tool results are returned, how state is preserved, when to stop, and how to recover when something fails. This pattern traces back to ReAct, which showed that interleaving reasoning traces with actions improves task solving, interpretability, and error handling because the model can update plans after observing the environment.

What a harness contains in practice

A harness is the runtime orchestration layer around an AI model that manages execution, tools, memory, state, safety, recovery, and coordination.A production harness usually includes the instruction assembly logic, the tool registry, the execution loop, state management, memory handling, permission checks, approvals, logging, tracing, retries, and recovery. OpenAI explicitly frames the harness as the full contract around the model, including instructions, tools, routing, output requirements, and validation checks. Anthropic similarly emphasizes that tool calls can be executed either by your application or by Anthropic, depending on whether the tool is client side or server side, which is why the harness must understand where execution happens and how results flow back into the loop.

Control Plane + Runtime + Execution Orchestrator

Anthropic’s view of the harness

Anthropic’s current Managed Agents offering is a useful reference point because it packages the harness for you. Anthropic describes it as a pre built, configurable agent harness that runs in managed infrastructure, and says it is best for long running tasks and asynchronous work. In their model, the system is built around four concepts:

the Agent, which includes the model, system prompt, tools, MCP servers, and skills;

the Environment, which is the configured container template;

the Session, which is a running instance for a specific task; and

Events, which are the messages exchanged between your application and the agent.

Anthropic also says Managed Agents provides built in tools, custom tools, and persistent memory stores that can survive across sessions, with session level memory versions for auditability and recovery.

Anthropic’s tool model makes the split concrete. Built in tools include file operations, shell execution, web search, and web fetch, while custom tools are executed by your application and fed back into Claude. Their memory docs also make an important security point: because memory stores can be read write by default, untrusted input can poison later sessions, so read only should be used for reference material or any store the agent does not need to modify. That is exactly the sort of policy a harness must enforce.

OpenAI’s view of the harness

OpenAI uses the same architectural idea but names the boundary very explicitly. In the Agents SDK and sandbox docs, the harness is the control plane around the model, and compute is the sandbox execution plane. The harness owns tool routing, handoffs, approvals, tracing, recovery, and run state, while the sandbox is where files are read and written, commands run, dependencies install, services are exposed, and state snapshots live. OpenAI says sandboxes are especially useful when the agent needs to manipulate files, run commands, produce artifacts, expose a service, or continue stateful work later.

OpenAI’s Codex articles show the same pattern in a coding system. They describe the agent loop as the core logic that orchestrates interaction between the user, the model, and the tools, and explicitly call that runtime the harness. In the improvement loop example, OpenAI says the harness is the full contract around the model, and the flywheel is traces, feedback, evals, and then a developer handoff to Codex to implement the recommended harness changes. That is a very production oriented view of harness engineering, where observability and iteration are first class design goals.

Fundamental Mental Model

The cleanest way to understand the harness is:

Agent = Model + Harness

The model provides:

Language understanding
Planning
Reasoning
Tool selection
Decision generation

The harness provides:

Execution
State
Tools
Safety
Runtime management
Coordination

The LLM itself only performs:

Reasoning + Decision Generation

Everything else belongs to the harness.

Why Harnesses Became Necessary

Early LLM systems were simple:

User → Prompt → LLM → Response

This architecture breaks immediately in production because:

This forced the emergence of the harness architecture.

Complete System Architecture

                     ┌─────────────────────┐
                     │      USER/API       │
                     └──────────┬──────────┘
                                │
                                ▼
                ┌─────────────────────────────┐
                │         HARNESS             │
                │─────────────────────────────│
                │ Prompt Assembly             │
                │ Agent Loop                  │
                │ Tool Registry               │
                │ Memory Manager              │
                │ Policy Engine               │
                │ Session State               │
                │ Retry Logic                 │
                │ Tracing                     │
                │ Human Approval              │
                │ Multi-Agent Router          │
                │ Evals                       │
                └──────────┬──────────────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
  ┌────────────┐   ┌────────────┐   ┌────────────┐
  │   TOOLS    │   │ SANDBOXES  │   │  MEMORY    │
  └────────────┘   └────────────┘   └────────────┘
                           │
                           ▼
                   ┌──────────────┐
                   │    MODEL     │
                   └──────────────┘

Evaluation harness versus runtime harness

People often use harness to mean the runtime, but there is also an evaluation harness. Anthropic defines an evaluation harness as the infrastructure that runs evals end to end, provides instructions and tools, runs tasks concurrently, records every step, grades outputs, and aggregates results. They also state that when you evaluate an agent, you are evaluating the harness and the model together. That distinction matters because a good model inside a weak harness can still fail in production, while a strong harness can make a smaller model dramatically more useful and reliable.

The practical anatomy of a good harness

A strong harness usually has a bounded agent loop, a strict tool contract, explicit context management, a durable session store, a sandbox or other execution boundary, lifecycle hooks, and a safety layer that can block or modify unsafe actions. Arize’s recent field report captures the same industry convergence and describes a harness as a fixed runtime architecture with an iteration loop, context management, tool registry, permission layer, and end to end observability, while Martin Fowler argues that choosing a constrained topology makes a comprehensive harness more achievable because it reduces the variety the system must regulate.

The main engineering tradeoff is flexibility versus control. A permissive harness can do more kinds of work, but it becomes harder to secure, test, and debug. A constrained harness can be much safer and more legible, especially in coding or document workflows, because the environment, tool set, and state transitions are all narrow enough to reason about. OpenAI’s engineering post on Codex says early progress was slower when the environment was underspecified, and that the team had to make the application legible to the agent with structures like repository knowledge, UI logs, and metrics. That is a strong signal that harness quality is often the real bottleneck, not raw model quality.

Core Components of a Harness

Prompt Construction Layer

The harness dynamically assembles prompts from:

User input
Session history
Memory retrieval
Tool outputs
Policies
Agent state
System instructions

Example:

context = {
    "system_prompt": system_rules,
    "memory": retrieved_memory,
    "tools": available_tools,
    "history": conversation_history,
    "current_task": task
}

The harness decides what the model sees.

Agent Loop

The loop is the heart of the harness.

Basic loop:

Reason → Decide → Execute → Observe → Repeat

This originated from the ReAct paradigm.

Example Loop

while not done:

    response = llm(context)

    if response.tool_call:
        result = execute_tool(response.tool_call)
        context.append(result)

    elif response.final_answer:
        done = True

This transforms static inference into autonomous behavior.

Tool Orchestration

Tools are external capabilities:

The harness:

Registers tools
Validates permissions
Executes calls
Returns outputs
Handles retries
Normalizes responses

Tool Registry Architecture

TOOLS = {
    "search": SearchTool(),
    "calculator": CalculatorTool(),
    "python": PythonSandboxTool()
}

The harness controls access:

if tool_name not allowed:
    deny_execution()

This is critical for security.

Memory Systems

Memory is one of the most misunderstood parts of agent systems.

The harness manages multiple memory layers.

Short-Term Memory

Conversation/session state.

Current task
Recent tool outputs
Recent thoughts
Temporary state

Usually stored in:

Redis
In-memory cache
Session stores

Long-Term Memory

Persistent storage.

Examples:

Vector databases
Knowledge graphs
Document stores
Relational DBs

Used for:

User preferences
Historical tasks
Organizational knowledge
Agent learning

Episodic Memory

Stores past execution traces.

Useful for:

Reflection
Self-improvement
Recovery
Planning

Sandbox Execution

One of the most important modern harness concepts.

The model must never directly execute arbitrary code.

Instead:

Harness → Sandbox → Execution

Sandbox responsibilities:

Isolated execution
Resource limits
Network controls
File system controls
Dependency management
State snapshots

Sandbox Architecture

LLM
 │
 ▼
Harness
 │
 ▼
Container / VM / Firecracker
 │
 ├── Python
 ├── File System
 ├── Temporary State
 ├── API Access
 └── Output Artifacts

Technologies:

Policy Enforcement Layer

Critical for enterprise deployment.

The harness validates:

Tool permissions
Dangerous actions
Data access
Network calls
Prompt injection
Secret leakage
Destructive operations

Example:

if action == "delete_database":
    require_human_approval()

Human-in-the-Loop Systems

Modern harnesses often include approval workflows.

Multi-Agent Harnesses

The harness also orchestrates multiple agents.

Architecture

                Supervisor Agent
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
  Research Agent  Coding Agent  Review Agent

The harness manages:

Routing
Shared memory
Message passing
Conflict resolution
Synchronization

Why Multi-Agent Systems Become Dangerous

One of the biggest industry realizations today is:

More agents ≠ Better systems

Problems:

Anthropic engineers have discussed similar orchestration complexity concerns in agentic workflows.

Observability and Tracing

Production harnesses require full observability.

Without tracing, debugging agents becomes impossible.

Tracing Includes

Prompts
Tool calls
Latency
Token usage
Failures
Agent transitions
Memory retrievals

Popular tools:

Evals Harness

Separate from runtime harness.

Purpose:

Measure reliability and quality

An eval harness:

Runs benchmark tasks
Executes agents repeatedly
Records traces
Scores outputs
Detects regressions

Runtime Harness vs Eval Harness

Harness Failure Modes

Infinite Loops

Agent repeatedly calls tools.

Mitigation:

MAX_ITERATIONS = 8

Prompt Injection

External content manipulates the agent.

Mitigation:

Input sanitization
Context isolation
Read-only memory

Tool Abuse

Agent performs unsafe actions.

Mitigation:

Permission policies
Sandboxing
Human approval

Context Explosion

Too much memory causes degradation.

Mitigation:

Summarization
Retrieval filtering
Context windows

Enterprise Harness Architecture

Production Reference Design

                    API Gateway
                          │
                          ▼
                  Harness Orchestrator
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
   Session Store     Policy Engine     Tracing Layer
        │                 │                 │
        ▼                 ▼                 ▼
   Vector DB         Approval Queue      Metrics DB
        │
        ▼
    Tool Layer
        │
        ▼
    Sandboxes
        │
        ▼
      Models

A reference architecture you can implement

A production grade harness usually looks like this: the user sends a task, the harness composes the system prompt and context, the model proposes either a final answer or a tool call, the harness validates the call against policy, executes the tool in a safe environment, captures the result, writes the result back into state, and loops until termination. Termination happens on a final text answer, a max iteration limit, a recoverable stop condition, or an unrecoverable error. This is the common pattern across Anthropic’s tool use model, Anthropic’s managed agents, and OpenAI’s sandbox and agent loop design.

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Protocol, Union


class Tool(Protocol):
    name: str

    def run(self, arguments: Dict[str, Any]) -> Any:
        ...


@dataclass
class AgentState:
    session_id: str
    messages: List[Dict[str, Any]] = field(default_factory=list)
    scratchpad: Dict[str, Any] = field(default_factory=dict)
    iterations: int = 0
    max_iterations: int = 8


@dataclass
class ToolResult:
    tool_name: str
    result: Any
    is_error: bool = False


class Harness:
    """
    Minimal production-style harness skeleton.

    Responsibilities:
    1. Build context
    2. Ask the model for the next step
    3. Validate and execute tools
    4. Persist state
    5. Stop safely
    """

    def __init__(
        self,
        model_call: Callable[[List[Dict[str, Any]]], Dict[str, Any]],
        tools: Dict[str, Tool],
        policy_check: Callable[[str, Dict[str, Any]], bool],
        state_store: Callable[[AgentState], None],
        trace: Optional[Callable[[str, Dict[str, Any]], None]] = None,
    ) -> None:
        self.model_call = model_call
        self.tools = tools
        self.policy_check = policy_check
        self.state_store = state_store
        self.trace = trace or (lambda event, payload: None)

    def run(self, state: AgentState) -> Dict[str, Any]:
        while state.iterations < state.max_iterations:
            state.iterations += 1
            self.trace("iteration_start", {"iteration": state.iterations})

            model_output = self.model_call(state.messages)
            self.trace("model_output", model_output)

            if model_output.get("type") == "final":
                state.messages.append({"role": "assistant", "content": model_output["content"]})
                self.state_store(state)
                return {"status": "done", "content": model_output["content"]}

            if model_output.get("type") != "tool_call":
                state.messages.append({"role": "assistant", "content": "Invalid model output."})
                self.state_store(state)
                return {"status": "error", "reason": "invalid_model_output"}

            tool_name = model_output["tool_name"]
            arguments = model_output.get("arguments", {})

            if tool_name not in self.tools:
                state.messages.append({"role": "assistant", "content": f"Unknown tool: {tool_name}"})
                self.state_store(state)
                return {"status": "error", "reason": "unknown_tool"}

            if not self.policy_check(tool_name, arguments):
                state.messages.append({"role": "assistant", "content": f"Denied tool: {tool_name}"})
                self.state_store(state)
                return {"status": "error", "reason": "policy_denied"}

            try:
                result = self.tools[tool_name].run(arguments)
                tool_result = ToolResult(tool_name=tool_name, result=result)
            except Exception as exc:
                tool_result = ToolResult(tool_name=tool_name, result=str(exc), is_error=True)

            state.messages.append(
                {
                    "role": "tool",
                    "tool_name": tool_name,
                    "content": tool_result.result,
                    "is_error": tool_result.is_error,
                }
            )
            self.state_store(state)
            self.trace("tool_result", {"tool_name": tool_name, "is_error": tool_result.is_error})

        return {"status": "stopped", "reason": "max_iterations"}

How to design it correctly

The biggest architectural decision is the boundary between the harness and compute. OpenAI’s sandbox guidance is very clear that the harness should keep auth, billing, audit logs, human review, and recovery outside the container, while the sandbox should handle the mutable work. Anthropic’s memory guidance points to the same principle from a different angle, since persistent state must be separated by trust level or it can become an injection path. In practice, that means your harness should treat web content, user input, and third party tool output as untrusted, keep secrets out of model visible storage, and use read only memory or scoped mounts whenever possible.

The other major design choice is how much autonomy you allow. Anthropic’s Messages API is positioned for custom agent loops and fine grained control, while Managed Agents is positioned for long running and asynchronous tasks. OpenAI’s sandbox approach also assumes that some work belongs inside a managed execution environment rather than directly in prompt space. So the right harness is usually not the most autonomous one, it is the one that is easiest to control, inspect, and recover when things go wrong.

The evaluation and improvement loop

A real harness is never finished. Anthropic’s eval guidance says the harness and model are evaluated together, which means you must test not just model quality but the entire orchestration stack. OpenAI’s improvement loop recommends using traces to see what happened, feedback to explain what mattered, and evals to make those expectations reusable. That is the right operational model for production: trace every run, turn failures into evals, and use the evals to evolve the harness instead of repeatedly patching prompts by hand.

The harness is responsible for:

Harness Design Principles

Deterministic Infrastructure

The model may be probabilistic.

The infrastructure must not be.

Constrained Autonomy

Never allow unrestricted execution.

Explicit State

All state transitions must be observable.

Separation of Concerns

Relationship with MCP

Anthropic’s Model Context Protocol (MCP) integrates naturally with harnesses.

The harness acts as the MCP orchestrator.

MCP standardizes:

Tool schemas
Resource discovery
Structured capabilities
Agent interoperability

Modern Agent Stack

Full Stack

Application Layer
        │
Harness / Orchestrator
        │
Memory + Tools + Policies
        │
Sandbox Infrastructure
        │
LLM Providers

The Most Important Insight : The industry is realizing something fundamental.

The harness is becoming more important than the model itself.

Why?

Because production reliability depends more on:

Orchestration
Tool reliability
Context engineering
Safety systems
Observability
Recovery
Memory quality

than raw benchmark intelligence.

Future of Harnesses

Future harnesses will likely include:

A harness is the code and infrastructure around an AI model that makes the model usable as an agent. It includes the loop, the tool system, state and memory, execution boundaries, permissions, tracing, recovery, and evaluation. The model thinks. The harness makes thinking executable, safe, and repeatable

Harness in AI Agents was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked