Karpathy Said Kill Prompt Engineering. Here’s What He Actually Meant.

digitado ⋅ 23 de May de 2026

The internet declared a funeral. The body wasn’t dead.

Last June, Andrej Karpathy — former Tesla AI director, OpenAI co-founder, and one of the few people in the field whose offhand posts routinely restructure how thousands of developers work — posted something that landed like a scheduled demolition.

“+1 for ‘context engineering’ over ‘prompt engineering,’” he wrote on X. “People associate prompts with short task descriptions you’d give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step.”

Shopify CEO Tobi Lütke amplified it within hours. Simon Willison, one of the most careful voices in the AI developer community, wrote that “prompt engineering” had been “redefined to mean typing prompts full of stupid hacks into a chatbot” and that the serious discipline it once pointed to had been buried under two years of Twitter jailbreak threads. By July, Gartner had published: “Context engineering is in, and prompt engineering is out.”

The response was immediate and tribal.

Team Context Engineering said prompt engineering was always a parlor trick. They pointed to real production failures: elaborate system prompts degrading model performance, persona instructions confusing instruction-following, chain-of-thought templates that worked in demos and broke in production. “Prompt engineering is cute for demos,” went the line. “But real-world AI needs structure. Memory. History. Tools. Data.”

Team Prompt Engineering fired back. This is just rebranding. System prompts still exist. Few-shot examples still work. You’re still writing instructions to a model — you’ve just renamed the activity to sound more serious. “Context engineering is just prompt engineering with a PR budget.”

Both camps misread what Karpathy actually said.

What Karpathy Actually Said

Re-read the post, and the argument is narrower and more specific than either side treated it.

Karpathy was not saying that writing instructions for AI is dead. He was saying that the term “prompt engineering” has been captured by a conception of the work that doesn’t match what serious production AI applications actually require.

The specific failure he identified: people treat a “prompt” as the thing you type before the model responds. But in any real AI application — an agent that codes, a chatbot that handles customer queries, a pipeline that processes documents — the thing the model sees isn’t just a prompt. It’s a context window, assembled at runtime from multiple sources: system instructions, retrieved documents, conversation history, tool outputs, memory, user metadata, and intermediate reasoning steps.

Managing that assembly — deciding what goes in, what stays out, in what order, at what granularity — is the actual engineering problem. It is not trivially related to the question of how to phrase a task description. Calling it “prompt engineering” suggests the bottleneck is your word choice. It isn’t. The bottleneck is architecture.

As Karpathy put it in a follow-up comment: “Too little or of the wrong form and the LLM doesn’t have the right context for optimal performance. Too much or too irrelevant, and the LLM costs might go up, and performance might come down. Doing this well is highly non-trivial.”

He wasn’t declaring a funeral. He was pointing at a mislabeled door.

Where Context Engineering Is the Right Frame

The cases where context engineering is the meaningful upgrade from traditional prompt engineering all share one property: the model’s performance is bottlenecked by what it has access to, not by how elegantly you asked.

Agentic workflows. When an agent executes multiple steps — planning, tool use, code execution, result synthesis — each step requires a freshly assembled context. What the agent needs at step 4 is not what it needed at step 1. A static system prompt cannot handle this. You need to manage what gets carried forward, what gets compressed, what gets retrieved from memory, and what gets dropped.

A naive agent prompt looks like this:

You are a helpful software engineer assistant. You have access to the following tools:
run_code, search_web, read_file, write_file. Complete the user's task step by step.
Be careful, thorough, and accurate. Always explain your reasoning.

A context-engineered agent populates the window differently at each step:

[SYSTEM]: You are a code review agent. Focus only on the current task.
[RETRIEVED]: PR diff for auth.py — 127 lines modified
[TOOL RESULT]: Static analysis output — 3 warnings, 1 error
[MEMORY]: Prior session: user prefers inline comments over summary reports
[TASK]: Review the authentication changes for security issues

The difference isn’t word choice. It’s what’s in the room when the model starts working.

RAG applications. Retrieval-augmented generation lives or dies on context assembly. You can have the most eloquent system prompt in the world, but if you’re retrieving the wrong chunks — too many, too few, wrong granularity, wrong ordering — the model can’t compensate. A 2025 study from Stanford’s Center for Research on Foundation Models found that retrieval quality accounted for more output variance than model selection across standard QA benchmarks.

Long-running pipelines. When models are chained — output of step N becomes input of step N+1 — context pollution accumulates. Intermediate reasoning that was useful in one step becomes noise in the next if it’s not managed. The engineering problem is deciding what to compress, summarize, or discard between steps. There is no prompt trick that substitutes for this.

Where Prompt Engineering Is Still the Real Work

Karpathy’s reframe does not make your system prompt irrelevant. Here is where careful instruction design still dominates outcomes:

Single-turn structured tasks. When you’re asking a model to convert a document, classify an input, extract structured data from text, or generate something to a precise format, context assembly is trivial — you’re just giving it the document. The variable is the instruction: how you specify the output schema, what examples you provide, what constraints you articulate.

A mediocre extraction prompt:

Extract the key information from this contract.

A well-engineered one:

You are a contract analyst. Extract the following fields from the contract below.
Return a JSON object with exactly these keys: party_a, party_b, effective_date,
termination_clause, governing_law. If a field is absent, return null.
Do not infer or guess values not explicitly stated.

Contract:
{{contract_text}}

The context window is not the bottleneck here. The instruction is.

Few-shot examples. When a model is struggling with a novel task or an unusual output format, providing two or three worked examples in the prompt remains one of the highest-leverage interventions available. This is prompt engineering in its purest form, and it still works.

Model behavioral calibration. Specifying tone, persona, level of detail, and refusal behavior in a system prompt is not obsolete. For deployed products, these instructions are the difference between a model that behaves consistently and one that surprises users. The work of getting these right is careful and iterative — and it is prompt engineering.

The Decision Framework

The question is not “prompt engineering or context engineering.” The question is: what is the actual bottleneck in my system?

Most real systems have both problems at different layers. The error is treating one as a substitute for the other.

Try This First

If you’re building with an agent or multi-step pipeline and hitting quality issues, run this diagnostic before changing your prompt:

# Log what the model actually sees at each step
import json

def log_context(step_name, messages):
    total_tokens = sum(len(m['content'].split()) for m in messages)
    print(f"n=== {step_name} ===")
    print(f"Total messages: {len(messages)}")
    print(f"Estimated tokens: {total_tokens}")
    print(f"Roles: {[m['role'] for m in messages]}")
    print(f"Last message preview: {messages[-1]['content'][:200]}...")

# Use this to see what you're actually passing before debugging the prompt

More often than not, you’ll find the problem isn’t your instruction phrasing. It’s what got into the window — or what didn’t.

The Actual Point

Karpathy wasn’t declaring that writing instructions to AI is a dead skill. He was pointing out that the industry had been optimizing for the wrong variable — polishing the prompt while ignoring the context it sits in.

The tribal read (“prompt engineering is dead / this is just rebranding”) missed the diagnostic. In simple, single-turn applications, prompt engineering remains the primary lever. In complex, multi-step, production systems, the context assembly surrounding the prompt is the primary lever, and most teams are not thinking carefully about it.

Both skills are real. Both are necessary. What changed is that as applications get more complex, the ratio shifts. The prompt is no longer enough to carry the system. The window it lives in has to be engineered too.

That’s not a funeral. It’s a scope expansion.

Are you hitting walls with prompt quality, or with context assembly? The answer probably tells you where to focus next.

Karpathy Said Kill Prompt Engineering. Here’s What He Actually Meant. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked