From Prompts to Harnesses: How AI Engineering Has Grown Up

digitado ⋅ 5 de May de 2026

I want to talk about something that has been quietly changing how serious developers build with AI tools.

It did not happen overnight. There was no single blog post or conference talk that kicked it off. It happened gradually, the way most good ideas in software do. Developers ran into the same problems over and over again, figured out better ways to solve them, and those better ways eventually got names.

The names are: prompt engineering, context engineering, and harness engineering.

If you have been building anything with AI over the past couple of years, you have probably been doing all three without realizing it. This article is about making that progression visible and intentional, so you can be more deliberate about where you spend your energy.

The name is harness engineering. And in the span of about six weeks in early 2026, it went from a niche practitioner habit to the most-talked-about concept in software development circles. Mitchell Hashimoto wrote about it on February 5, 2026. OpenAI followed on February 11. Martin Fowler published an analysis shortly after. The term had arrived.

But to understand why harness engineering matters, you have to see how we got here. The story begins two years earlier, with prompts.

I have been covering pieces of this journey across several articles. I wrote about Model Context Protocol when Anthropic first released it. I wrote about contextual chatbot conversations back in the Watson days. I wrote about process monitoring scripts in Linux as a way of thinking about systems that watch themselves. All of that comes together here.

Let us start from the beginning.

Prompt Engineering

What It Is

When AI language models became easy to use, the first thing developers noticed was that the way you wrote your question had a huge effect on the quality of the answer.

Ask “fix this code” and you get one kind of answer.

Ask “fix this code, explain what was wrong, and make sure the fix works with Python 3.10” and you get a much more useful answer.

That observation, repeated thousands of times across thousands of developers, turned into a discipline. People started calling it prompt engineering, and they started sharing what worked.

Why the Wording Matters So Much

A language model does not read your prompt the way a human reads a sentence. It is trying to predict what the most useful continuation of your text would be, based on everything it learned during training. When you add more details, more constraints, and clearer goals, you are narrowing down the space of possible answers it might give you. That usually means you get something closer to what you actually wanted.

Here is a simple way to think about it:

Prompt Engineering

The loop of writing a prompt, reading the answer, and adjusting the prompt is basically the whole job of prompt engineering. Simple in concept, but there is a lot of skill in knowing how to adjust.

The Main Techniques

Zero-shot prompting means you just ask your question with no examples. “Translate this sentence to French.” Clean and direct. Works well for things the model has seen a lot of during training.

Few-shot prompting means you give the model two or three examples before your real question. If you want the model to format output in a specific way, showing it what the output should look like is far more reliable than describing it in words. Think of it like showing a new colleague what “done” looks like rather than writing a two-page specification.

Few Shot prompting structure

Chain-of-thought prompting means you ask the model to work through a problem step by step rather than jumping straight to the answer. Adding something as simple as “think through this step by step before giving your final answer” can improve accuracy noticeably on math, logic, and reasoning tasks. The model is better at catching its own mistakes when it shows its work.

Tree-of-thought prompting takes that idea further. Instead of one chain of reasoning, you ask the model to explore several possible approaches, evaluate them, and pick the best one. It is slower and uses more tokens, but for genuinely hard problems it often gives better results.

Here is what that exploration looks like compared to a straight chain of thought:

Where Prompt Engineering Stops Being Enough

Prompt engineering works really well for one-off tasks. You write a good prompt, get a good answer, move on.

The trouble starts when you try to build something repeatable or run an AI tool as part of a regular workflow. Every new conversation starts fresh. The model does not remember what you told it last time. If you have rules the model needs to follow, you have to include them again every single time. If your question depends on information that changes, like the current state of your codebase or recent log output, you have to paste that in yourself.

Prompt engineering also does not help with consistency across a team. If six developers are using the same AI tool, they are probably writing six different versions of the same prompts and getting six different qualities of output.

These are the problems that context engineering was built to solve.

Context Engineering

What It Is

Context engineering is about deciding what information the model gets, not just how you word your question.

The model can only work with what you give it. It does not know what is in your private codebase. It does not know what your team decided in last week’s meeting. It does not know what error appeared in your logs this morning. If any of that information is relevant to the task, you have to include it in the conversation yourself.

Context engineering is the practice of doing that thoughtfully and systematically.

I explored an early version of this back in my contextual chatbot article where I was working with Watson. The challenge then was keeping track of what the user had already said so the bot could respond sensibly instead of treating every message as a brand new question. The tools have changed a lot since then, but the core problem is the same: the model needs to know what happened before.

How Context Is Built

There are several ways to give a model the context it needs.

System prompts are instructions that sit above the conversation and stay in place for every single message. They are where you put things like “you are a code reviewer who specializes in Python” or “always respond in plain English, never use technical terms” or “never suggest deleting data without asking for confirmation first.” Think of them as standing orders.

Conversation history is the record of everything that has been said so far in the current session. When you include past messages in a new request, the model can refer back to earlier decisions and stay consistent. Without this, the model has no idea what was already discussed.

Retrieved documents is a technique where you search a knowledge base for information relevant to the current question and paste it into the prompt. This is commonly called RAG, which stands for Retrieval Augmented Generation. The idea is that instead of the model guessing from general training knowledge, it reads actual current information you have pulled from somewhere specific.

Here is the overall flow:

Tool use is a way of letting the model reach out during a conversation to fetch information or take actions. Instead of you preparing all the context in advance, the model can ask for what it needs. It might call a weather API, query a database, or read a file. This is more dynamic and works well for tasks where you cannot predict exactly what information will be needed.

The Model Context Protocol

This is where MCP fits in. Anthropic released MCP as an open standard for connecting AI tools to external data sources and services. Before MCP, every AI tool had to build its own custom connectors for every data source. After MCP, you build the connector once and any tool that supports MCP can use it.

Think of it the way USB changed hardware. Before USB, every device had its own port. After USB, one standard port worked with almost everything. MCP is trying to do the same thing for AI data connections.

Where Context Engineering Stops Being Enough

Context engineering is a big step forward. Models with good context are more consistent, more accurate, and more useful than models relying only on a well-worded prompt.

But a model with good context is still just a model. It has no judgment about whether what it is about to do is risky. It does not hesitate when it is about to break something. It does not know when to stop. And as Louis Bouchard put it clearly in his March 2026 writeup on harness engineering: “A bigger context window does not magically turn a flaky agent into a reliable system.”

If you are using the model for a one-time task, context engineering is enough. But if the model is running a multi-step job, writing and running code, making tool calls, operating across long sessions where it has to pick up where it left off, context engineering alone will not keep things from going wrong.

That is where you need a harness.

Harness Engineering

What It Is

A harness, in the physical world, is a set of straps and connections that keeps a person or object safe while they work in a risky environment. A rock climber wears a harness. A construction worker on a high scaffold wears a harness. The harness does not do the work. The person does the work. The harness makes sure that if something goes wrong, the consequences are limited.

Harness engineering applies the same idea to AI tools.

You build a set of rules, checks, and safety mechanisms around the AI model. The model still does the work. The harness makes sure that if the model does something wrong, the damage is caught early and does not spread.

Here is Mitchell Hashimoto’s framing of it, which I find particularly clear: when an AI makes a mistake, your job is to build something that ensures that specific mistake never happens again. Not just to fix the output this time. But to change the system so the model cannot make that mistake next time.

That is a different mindset than just re-prompting until you get a good answer.

The Five Layers of a Harness

A well-built harness has five layers. Each one adds a different kind of safety. You do not need all five from day one, but understanding them helps you decide which ones matter most for what you are building.

Layer 1: Limits

The first thing you define is what the AI tool is allowed to do and what it is absolutely not allowed to do.

This is not about instructions or guidelines. This is about hard structural limits. You are not asking the model to be careful. You are setting up the environment so that certain things are simply not possible.

Some common examples:

Sandboxed environments. You run the AI tool inside a container or VM where it can only see certain files and cannot touch anything outside its workspace. Even if the model produces code that tries to delete files outside its scope, the environment will not let it happen.

Read-only access. For certain sensitive parts of a system, you grant the AI tool read access but no write access. It can look but not touch.

Network rules. You can restrict what external addresses the tool can call. This prevents an AI agent from making unexpected API calls to places it should not be talking to.

Scoped file access. You tell the tool which directories it is allowed to work in. If you ask it to fix a bug in the payment service, it should not be touching the authentication service even if they are in the same repository.

The reason this layer comes first is that it does not depend on the model doing the right thing. It does not depend on good prompts or careful instructions. The limits are enforced by the environment itself. That makes them reliable in a way that instructions are not.

Layer 2: Instructions

The second layer is where you tell the tool what it should do, what conventions to follow, and what mistakes to avoid.

This is different from a one-off prompt. These are standing instructions that apply to every task the tool works on. They live in a file that travels with your project, usually named something like AGENTS.md or CLAUDE.md depending on which tool you are using.

This file might include things like:

The coding style your team uses
Which libraries are approved and which are not
How to handle errors (log them, do not swallow them silently)
What to do when a task is unclear (ask rather than guess)
Patterns that have caused problems in the past
The naming conventions for files, functions, and variables

The key insight from Hashimoto is that you should update this file every time the AI makes a mistake. Not just fix the output. Update the file. That way the lesson is recorded and the next run of the tool starts with that lesson already built in.

Over time, this file becomes a log of everything your team has learned about working with AI on your specific project. It is institutional knowledge, written down in a form the model can actually read.

This is very similar to what I wrote about in the Terraform and Pulumi article. When I hit the DNS resolution problem with Secrets Manager, the fix was not just changing the code. It was understanding why the old pattern was wrong and writing that understanding down so it would not get repeated. Your AGENTS.md is that write-up, but for your AI tool.

Layer 3: Checks

The third layer is where you verify that what the AI produced actually works and meets your standards.

This is the most familiar layer for most developers because it is what CI/CD pipelines have always done. The difference is that now you need to make these checks into hard gates that the AI tool cannot skip.

The checks you care about here are:

Linting. Does the code follow the formatting and style rules? Linters catch this automatically and give the same answer every single time. You do not need a human to eyeball indentation.

Type checking. If you are working in a typed language, does the code satisfy the type checker? This catches whole categories of bugs before the code ever runs.

Unit tests. Do the existing tests still pass? Did the AI break anything that was working before? These need to be run every time.

Integration tests. Does the changed code work correctly with the rest of the system? This is harder to automate but catches problems that unit tests miss.

Security scans. Did the AI introduce any obvious security problems? There are automated tools for this. They are not perfect but they catch common mistakes.

The important word in all of this is “gate.” The AI tool cannot move forward until these checks pass. It is not a suggestion. It is a requirement.

Checks

I wrote about the value of self-monitoring systems in the process monitoring article. A system that watches itself and reacts to problems is far more reliable than one that relies on a human to notice something is wrong. The verification layer is exactly that for AI-generated output.

One thing worth noting: these checks are deterministic. The same code going in will get the same result every time. It does not matter which model you used, how you phrased the prompt, or which day of the week it is. That consistency is what makes this layer so valuable.

Layer 4: Recovery

The fourth layer answers a question that sounds obvious but often gets ignored: what happens when a check fails?

Without a recovery layer, you have two bad options. The AI tool loops forever trying to fix a problem it cannot solve. Or someone has to manually intervene every time something goes wrong. Neither of those is acceptable when you are trying to build something reliable.

A recovery layer gives you structured ways to handle failures.

Retry with a count limit. If a check fails, send the failure message back to the AI and let it try again. But set a limit. If it fails three times in a row, stop and escalate. An AI that keeps trying and keeps failing is wasting time and potentially making things worse.

Rollback. If the AI has made a series of changes and the system is now in a worse state than before, you need a way to undo everything back to the last known good point. Version control makes this possible if you have set things up correctly.

Smaller scope. Sometimes a task is too big and the AI gets confused partway through. Breaking the task into smaller pieces and retrying each piece separately often works better than retrying the whole thing.

Human escalation. Some failures are beyond what an automated system can handle. You need a clear path for these situations to land in front of a human who can figure out what to do next.

The recovery layer is what separates a demo from a production system. A demo can afford to have a human watching every run. A production system needs to handle failures on its own for the routine cases.

Layer 5: Review

The fifth layer is a review step, done either by a separate AI tool or by a human.

The key word is separate. The AI that did the work should not be the same one that reviews it. There is a well-documented tendency for models to approve their own output even when they can see problems with it. They are optimistic about themselves.

A separate reviewer, given a prompt that says “your only job is to find problems with this output,” will catch things the original tool missed. It has no attachment to the decisions that were made. It is looking with fresh eyes.

Some teams set up a three-step flow: one tool plans the approach, another tool does the work, a third tool reviews what was done. Each tool has a different role and a different set of instructions.

An interesting thing to know about review tools: Anthropic’s own engineering team found that in early versions of their review setup, the evaluator would tend to approve output even when it had found problems with it. It took extra work to tune the reviewer to actually hold firm on issues it detected. Even the review layer needs careful setup.

To make review output useful rather than just noise, it helps to sort findings by how serious they are:

For security-sensitive systems, the human review step at the end is not optional. It is the final gate before anything goes live. The AI layers before it narrow down the things a human needs to look at, but a human still makes the final call.

This connects to something I wrote about in the chain of trust article. In SSL, trust is not a single yes or no decision. It is a chain where each link has to be verified before you trust the next one. A harness works the same way. Each layer verifies something specific. The final output is only trustworthy because every link in the chain held.

How the Three Approaches Fit Together

Prompt engineering, context engineering, and harness engineering are not competing ideas. They are three layers of the same thing. You need all three if you want to build AI tools that work reliably.

Here is a simple way to figure out where you currently are and what to focus on next:

How to Start Without Overbuilding

I am a practical developer. I do not like adding complexity unless it solves a real problem.

So here is how I would approach building a harness based on how much is at stake.

If you are exploring or building something personal: Start with Layer 2 only. Write a good AGENTS.md for your project. Update it when the AI does something wrong. That alone will improve consistency a lot. You do not need sandboxing or review pipelines for a side project.

If you are adding AI to a team workflow: Add Layer 3. Wire your existing linting and tests as gates the AI cannot skip. This gives you confidence that the AI is not quietly breaking things. Make sure the same checks that run on human-written code also run on AI-written code.

If the AI is taking actions that affect data or services: Add Layer 1 and Layer 4. Lock down what the tool can touch. Set up rollback. Define what happens when things go wrong before they go wrong.

If the AI is working on security-sensitive or customer-facing systems: Add Layer 5. Build in a review step. At minimum, have a human review anything that goes to production. The other layers narrow down what needs human attention. The human review makes the final call.

A Note on Trust

One thing that has stuck with me from working with both traditional software systems and AI tools is that trust is something you build over time, layer by layer.

When you deploy a new microservice, you do not trust it completely from day one. You monitor it. You add alerts. You watch how it behaves under real load. Over time, as it proves itself, you rely on it more.

AI tools are the same. The harness is how you build that trust systematically rather than just hoping for the best.

Each layer of the harness is a checkpoint. Each checkpoint you add is one more reason you can say “I trust this output because I know it passed these specific tests, it was reviewed by a second tool, and it had hard limits on what it could even attempt to do.”

That is a much more solid foundation than “I read the output and it looked okay to me.”

Conclusion

The progression from prompts to context to harnesses is the natural path of any new technology moving from novelty to production use.

Early on, the excitement is about what the technology can do at all. You discover that carefully written prompts get better answers than careless ones. Amazing. You share what works.

Then you start using the technology for real work and you realize that getting a good answer to one question is not the same as building something reliable. You start thinking about context, about consistency, about what the model needs to know to do its job well.

Then you put it into production and you realize that reliability is not just about good inputs. It is about what happens when things go wrong. It is about making sure the mistakes are caught, the failures are handled, and the output can be trusted.

That is harness engineering. It is not glamorous. It is not the part that goes in the demo. But it is the part that makes the difference between a tool that works sometimes and a system you can actually depend on.

Like 0

Liked Liked