The Free Agent Trap
Author(s): Ramon Invarato Originally published on Towards AI. The Free Agent Trap Why agents that run on their own are still failing A standalone article from the series “AI and You”. You’ve been promised the same thing as everyone else: that artificial intelligence no longer just answers questions — it does the work on its own. You assign a task, step out for a coffee, and come back when it’s done. That’s the promise of AI “agents,” talked about everywhere. It sounds great. The problem is what happens when nobody’s watching while the AI does the job. An example a developer shared in the comments of a technical blog (article in Spanish) illustrates it perfectly. They asked an agentic AI to display a customer’s orders on a web page. The AI solved it in a way that worked… and was a disaster: for each order, it fired an independent query to the database. A hundred thousand orders, a hundred thousand queries. The code compiled, the tests passed, and the screen showed the correct data. There was just one detail: it took twenty seconds to load and saturated the database, whereas a professional would have solved it with a single query in half a second. It worked up close; it was useless from a distance. That gap between “it works” and “it’s useful” sums up one of the great disappointments of this year: the promise of the autonomous agent — capable of executing long tasks on its own without human supervision — remains unfulfilled. Models are good at doing pieces of work well, but when left to run for fifteen or twenty consecutive steps, they fail in ways a human wouldn’t easily catch, and that can be very costly. This article explains why, covers the data backing it up, and discusses when it makes sense — and when it doesn’t — to use a free agent. This is not an article against AI: it’s a guide for using it and ideas for minimizing real risk. About the extreme cases in this article. Some comparisons, scenarios, and diagrams in this text are illustrative: they contrast extremes (utopia / dystopia) to make a range visible. They are not operational recommendations or predictions. The author takes no responsibility for how each reader uses these ideas. Full disclaimer text here. Conceptual representation of silent corruption: the agent iterates turn by turn, the document degrades from within, and the surface keeps looking impeccable. The representative log of how it happens in code, in “Anatomy of a silent disaster: the internal log of an agent.” The promise being sold The narrative of recent years has been clear: the next frontier of generative AI is not answering questions — it’s executing tasks. An agent receives a goal ( “book me a flight to London on Friday”, “refactor this code”, “put together a quarterly report with the CRM data”), breaks the problem into steps, executes those steps, validates whether the result is approaching the goal and, if not, retries. Ideally, you come back when it’s done. That promise is what has moved tens of billions of dollars in investment in 2024–2026. It’s also what justified Uber deploying agentic tools to their 5,000 engineers and burning through their entire annual AI budget in four months (article in Spanish). The actual capability of agents, measured with reproducible benchmarks, however, lags far behind the marketing being sold to us. This contradiction reaches into the heart of the very companies created to lead the AI revolution. The Anthropic Institute published “When AI builds itself” in June 2026, revealing that in May 2026 more than 80% of merged code in Anthropic’s internal repository was written by Claude — before Claude Code launched in preview (February 2025), that figure was in the low single digits. In Q2 2026, the typical engineer was merging 8× more code per day than in 2024; the report itself attributes that second jump to the moment models started working autonomously over longer time horizons, with the engineer in the role of director and reviewer rather than typist. Yet in a paradoxical twist, those same security teams and founders at Anthropic have led public warnings calling for caution and strict regulation in deploying advanced autonomy without guardrails. They know better than anyone that the speed of commercial adoption is running well ahead of the theoretical safety net of the models. The figure that keeps not moving: CRMArena-Pro In June 2025, Salesforce AI Research published CRMArena-Pro, the first serious benchmark for evaluating AI agents in realistic enterprise environments. The difference from previous benchmarks matters: CRMArena-Pro doesn’t measure whether the AI can answer an isolated question well. It measures what happens when it’s asked to execute a complete CRM task across multiple turns, with real data (over 83,000 synthetic but structurally representative records), simulating a conversation with a human user who keeps asking for things. The results, after evaluating the top models of the moment (Gemini 2.5 Pro and similar), were: Single-turn (a single interaction): success rate of 58%. Multi-turn (several chained interactions, which is what a realistic workflow looks like): success rate of 35%. Workflow Execution (structured tasks with clear steps): up to 83% in single-turn (which shows that when the structure is clear, the agent works reasonably well). Confidentiality awareness: practically zero without specific prompting. When asked to handle confidentiality, it improves, but the task success rate drops further. Translated into plain terms: a top-performing agent, left to its own devices in a realistic multi-turn workflow, fails two out of every three times. Not from a one-off bug — structurally. And when confidentiality oversight is added, it drops further still. DELEGATE-52: silent corruption So far, we’ve measured how often the agent gets it right. The next study measures something different and more unsettling: how much it damages what it touches when it works alone for an extended stretch. If CRMArena-Pro wasn’t enough, in April 2026 Microsoft Research published another study that attacks the problem from another angle. DELEGATE-52 measures what […]