Insights into Claude Opus 4.5 from Pokémon

Published on December 9, 2025 4:57 PM GMT

Credit: Nano Banana.
Credit: Nano Banana, with some text provided.

You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn’t beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]

This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness[2] and the relatively hands-off approach of its creator, David Hershey of Anthropic.[3] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika’s Gym for months on end, nothing substantial was done to give Claude a leg up.

But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.

Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.[4]

Improvements

Much Better Vision, Somewhat Better Seeing

Earlier this year, LLMs were effectively close to blind when playing Pokémon, with no consistent ability to recognize and distinguish doors, buildings, trees, NPCs, or obstacles.

For example, this screen:

Choosing your starter Pokemon.
Choosing your starter Pokémon in Professor Oak’s lab.

…at the time of Sonnet 3.7 confounded every LLM I tested it on, all of whom had difficulty consistently identifying where the pokeballs were, or figuring out which pokemon they wanted, sometimes even accepting the wrong starter by accident. Opus 4.5 made this look like the trivial problem that it is.[5]

In general, Opus 4.5 no longer has any trouble finding doors, and recognizes key buildings like gyms, pokemon centers, and marts the moment they appear on-screen. Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup.

Erika is second from left. A previous Claude, the only other Claude to ever reach this gym, failed to recognize the gym leader was here and kept insisting it had beat every trainer. Eventually it left and never came back.
Erika is second from left. A previous Claude, the only other Claude to ever reach this gym, failed to recognize the gym leader was here and kept insisting it had beat every trainer. Eventually it left and never came back.

The new vision is hardly perfect, though, suffering in proportion to whether or not Claude is paying attention, and whether or not Claude is willing to believe his own lying eyes. 

Attention is All You Need

On the first point, Claude very frequently seems to simply ignore things in his field of vision if he’s not “looking” there. Even worse, in key moments when he’s close to his current goal, he seems to rely on his vision less, and even ignore it entirely sometimes.

Claude in the infamous Team Rocket Hideout Hell.
Claude in the infamous Team Rocket Hideout Hell.

The above represents the ur-example of Claude “blindness”. Those two left-pointing arrows (“spinners”) to his left represent the only potential path to progress, but he knows his goal is to the right and even thinks he sees it. Claude visited this exact spot dozens of times and fewer than 5 times seemed to realize there were spinners to his left. Also, he clearly had trouble distinguishing the green boxes from the spinners and routinely tried stepping onto the boxes–a mistake that only materialized when he was close to the goal. He had no apparent problem telling the difference much of the rest of the time.

Here’s another example that will be traumatic to the Twitch viewers:

Claude in Celadon City, trying to find the gym.
Claude in Celadon City, trying to find the gym.

Here the tree which must be CUT to progress to the gym is clearly in view, but Claude is focused on looking for an open pathway and shows no sign of seeing it, walking right by–yet just minutes later he will spot it on the way back, having given up looking for an open pathway.[6]

The Object of His Desire

On the second point, Claude is noticeably much more prone to hallucinating or misidentifying objects as what he’s looking for if he really wants it to be there. 

The classic example here is Claude’s search for the elevator in the Team Rocket Hideout: 

Claude in another part of the Team Rocket Hideout.
Claude in another part of the Team Rocket Hideout, not anywhere near the elevator he’s looking for.
Claude's reasoning about the above screen.
Claude’s reasoning about the above screen.

It’s been hours and Claude has grown a bit desperate. This is also not the only time he hallucinates the elevator. For example, in the exact same spot we discussed earlier, which is actually quite near the elevator:

Claude still in the infamous Team Rocket Hideout Hell.
Claude still in the infamous Team Rocket Hideout Hell.
Claude's reasoning about the above screen.
Claude’s reasoning about the above screen, this time.

Now, the elevator is actually in that direction, and Claude even saw it (for real) earlier. But he’s become so fixated on it that he mistakes the gray wall for the elevator despite really knowing better.

And before you judge Claude too harshly, the elevator he’s searching for looks like this:

The dark pink carpet leads to the elevator. The elevator itself has its own screen which is more obviously an elevator, but there's no clear "elevator door" sprite as an entrance.
The dark pink/red carpet at the bottom leads to the elevator. The elevator itself has its own separate screen which is more obviously an elevator, but there’s no clear “elevator door” sprite as an entrance, just that carpet, which you have to remember leads to the elevator scene.
Nevertheless Claude can identify the carpet as the route to the elevator.
Nevertheless Claude can identify the carpet as the entrance to the elevator.

A Note

Let me be clear: I’m using the language of intentionality, as if Claude is choosing to ignore things. I don’t think that’s the case. I think his attention mechanisms actively screen out what they think is irrelevant, rendering the parts of the model trying to make decisions effectively blind to it.

Humans have built-in attention mechanisms, but they are clearly better built than this, even if they do have similar failure modes in extremis.

Mildly Better Spatial Awareness

I don’t want to oversell this one. Claude’s understanding of how to navigate a 2D world is clearly still below that of most children, but there are improvements:

  1. When trying to reach a door in front of a building and finding his path blocked from a particular direction, Claude will now try to walk around the other way.
  2. Claude can now maintain an awareness (via notes) about which parts of a building or city are relative to each other and perform simple navigation tasks.
    1. Previous versions constantly lost track of what wasn’t immediately in view.
  3. Claude can now perform some basic in-out geometric reasoning: leaving a building from the top of the room is likely to push me out the top of the building, elevators on different floors are probably in the same location on the floor, etc.

Better Use of Context Window and Note-keeping to Simulate Memory

Another obvious improvement to Claude’s capability is improved note-keeping and memory of context. Previous versions of Claude such as Sonnet 3.7 showed little sign that they “recalled” anything more recent than a few messages ago, despite having much of it in context. And while they were diligent notetakers, they only rarely seemed to read their own notes–and when they did, it was clearly in a stochastic manner, to the point that chat liked to speculate about whether Claude would read his notes this time and what part of his notes he would read.

Opus 4.5 is much, much better at both monitoring context and using notes, so much so that much of the time he manages to maintain a passable illusion of actually “remembering” the past 15 minutes or so, referencing recent events, evading past hurdles, and just generally maintaining a much more coherent narrative of what’s going on.

For longer-term memory, Claude must blatantly rely on whatever he happens to have written down in his notes, and he does a much better job of writing and reading his own instructions, routinely repeating past navigation tasks successfully and competently. Nowadays, if Claude does something and writes down how to do it, he can do it again.

It is difficult to overstate how much this contributes to a smoother, faster game flow. Claude can maintain navigational focus for extended periods, explore simple areas competently, and as long as the notes are good and his assumptions are sound, things flow smoothly.

Of course, sometimes things go haywire…

Self-Correction; Breaks Out of Loops Faster

…but more than before, things get fixed quickly. This is difficult to quantify, and I believe it stems heavily from a much better ability to notice when events are repeating within his context window. Claude more frequently and consistently notices when he’s trying something that clearly isn’t working and will try to vary it up. Coupled with his improved spatial reasoning, navigation tasks that took previous iterations days or weeks of trial and error have almost breezed by: Viridian Forest and Mt. Moon were relatively simple affairs, only a few loops around Vermilion City were necessary before he pathed right to the dock, etc.

It’s not all smiles and roses: Claude is still much slower than a human would be, and not every puzzle gets solved breezily. Nor has Claude deduced key facts like “walking in front of a trainer triggers a fight”, instead treating these as effectively random encounters.

Still it’s something.

Not Improvements

Claude would still never be mistaken for a Human playing the game

I’d like to tell a quick story to give readers the flavor of what it’s like to watch Claude sometimes, even when he’s technically accomplishing his goals with aplomb. This is the story of Claude attempting to acquire the Rocket HQ Lift Key, technically the first thing he did that no previous model had ever accomplished.

  1. Claude arrives in Team Rocket HQ, immediately declares the staircase next to him the elevator to Giovanni, for which he needs the Lift Key
  2. Claude then ignores the false “elevator” for hours, confident that he will be unable to “use” it, wandering the entire rest of the floor looking for the lift key or another set of stairs.
  3. Finally he chooses to try the original “elevator”, finding to his surprise that it works. He writes down in his notes that the elevator doesn’t need the Lift Key.
  4. Claude makes his way down two floors, encountering the infamous B3F maze. After getting stuck, he uses his only escape rope, then comes right back and, much to everyone’s surprise, solves the maze in one try, writing down the solution.
  5. On B4F after the maze, he clears out the area, but to his puzzlement fails to find Giovanni. He battles the Team Rocket Grunt carrying the Lift Key but doesn’t talk to them again, so they don’t give him the Lift Key. (to be fair on this point, this is an very confusing nuance in the game that has trapped many kids too, and GameFreak changed this in Pokémon Yellow)
  6. He concludes that he is mistaken, and needs to go back to the “elevator” he saw earlier on B1F and use that.
  7. After circling the elevator in frustration trying to “use” it, he concludes he’s missing a Lift Key, goes back to B3F, solves the maze trivially using his notes, and acquires the Lift Key.
  8. He then returns to the “elevator”. He circles the elevator for ~50 minutes, before finally concluding it’s not the real elevator but rather an “elevator/stairs” that mysteriously only connects two floors. Eventually, he amends this to “escalator”, which seems to resolve the cognitive dissonance and he happily refers to it as the escalator for the rest of the time he’s in Team Rocket Hideout.

Claude Still Gets Pretty Stuck

Early on, before the Team Rocket Hideout, watchers of Claude Plays Pokemon legitimately wondered if Anthropic had solved all of Claude’s main issues with the game, and perhaps everything would be smooth sailing from here on out. He had overcome some of earlier models’ biggest timesinks—Mt. Moon, Viridian Forest, finding the pathway from Cerulean City to Vermillion City, finding the Captain of the S.S. Anne—without difficulty.

But, critically, Claude had yet to hit the roadblocks that had permanently stopped previous models from progressing.

When he reached Ericka’s Gym (the one with the CUT-able tree I mentioned earlier), Claude spent ~4 days, or about 8000 reasoning steps, walking in a plain circle around the top of the gym looking for a path through.

What was he doing? Well, mostly trying to path through impassable walls and, knowing that CUT is involved in getting into the gym somehow, trying to cut through the gym’s roof.

Source: reasonosaur on /r/ClaudePlaysPokemon
Source: user reasonosaur on /r/ClaudePlaysPokemon

If there’s one thing Claude does have, it’s inhuman patience,[7] but even he eventually gave up, choosing to do Team Rocket Hideout first, which the game does allow you to do.

Over 13,000 reasoning steps later,[8] having completed the Team Rocket Hideout and other tasks, Claude returned and almost immediately found the proper CUT-able tree and finally progressed.[9]

Sometimes you just need to clear your head.

Claude Really Needs His Notes 

I think the anecdotes above mostly speak for themselves in illustrating the problems bad vision, cognitive bias, and inconsistent memory still give Claude Opus 4.5. But I would highlight how utterly dependent Claude is on the quality of his notes: One incorrect assumption or hallucination embedded into a note can crater progress for days, while a well-written note can achieve human-like performance.

I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.

Poor Long-term Planning

It is possible to detect other reasoning issues or inhuman thinking in Claude’s behavior, though these are not as crippling as the others.

Claude is incredibly short-term-goal-obsessed, and seems to have no interest in ever trying to do two things at once, even in the service of the greater goal. There also seems to be little reflection about the long-term consequences of an action, even in trivial ways.

Things that Claude has done that would be alien to human players:

  1. Spamming a valuable move with limited PP when there are clearly going to be many trainers ahead, without considering whether another move might be appropriate for the current fight (Ember to kill a grass type, for instance, to save Slash PP).
  2. When out of space in the inventory, Claude routinely trashes valuable items even when he could just use some of the items. Sometimes he trashes an item that could be used on the spot (e.g a stat boosting vitamin that could be fed to Charizard).
  3. Leaving Charizard out against a water type that could easily be handled by the grass type on the bench, wasting PP. In fact, he just loves using only Charizard.
    1. That this is an infamously child-like strategy says something about Claude’s cognitive development… or not, as Red is simple enough that just using Charizard is a mainline speedrun strategy. Though, Claude has never claimed to be following any such strategy.
  4. Not picking up a rare candy item that is blocking his path in Pokémon Tower for over an hour, because he was too focused on finding the path.
    1. In general Claude is strangely reluctant to pick up items.

Don’t Forget

Just recently, GPT-5.1 completed a run of Pokémon Crystal using a fairly minimal harness in 9,454 reasoning steps across 108 realtime hours. For comparison, the original Gemini 2.5 Pro Pokémon Blue run took 106,505 reasoning steps across 813 realtime hours, and Claude Opus 4.5 is already at 48,854 reasoning steps over 300+ hours. GPT-5.1’s 108 hours for Crystal is only ~3x as slow as a human player! Give a frontier LLM a solid minimap and some good prompts[10] and they’re not half bad at Pokémon these days.

Claude’s consistently minimal harness tells us something about progress in LLM cognition, but we shouldn’t forget that the past year’s improvements in efficient Pokémon agent harnessing tell us something too: raw intelligence is not the only lever pushing LLM performance forward. In fact, it’s not necessarily even the most effective one right now.

  1. ^

    That’s failures to improve by Claude Sonnet 4, Claude Opus 4, Claude Opus 4.1, and Claude Sonnet 4.5. At least in terms of story progression anyway, they have gotten faster at getting to the same story point at which they get stuck.

  2. ^

    There have been a few changes: support for Surf (now that Claude can get that far), removal of a bunch of tailored prompts, and a change where spinner tiles in mazes are labeled like obstructions, as well as a related change to wait for the player character to stop spinning before the screenshot of the current game state is taken. The latter two changes in particular make the Team Rocket Hideout easier than previous runs, though they don’t trivialize it. See this doc for more details.

  3. ^

    For more on Pokémon agent harnesses, see this previous LW post. But tl;dr harnesses do a lot of work to make the game understandable to an LLM, and use several techniques to address agentic weaknesses common to all LLMs. Even though the harnesses may seem fairly simple, and can (and have!) had their tools coded by the LLMs using them, game-winning harnesses have also been relentlessly optimized by human trial and error to provide exactly the support necessary to overcome current LLM limitations.

  4. ^

    With some editing on my part.

  5. ^

    Modern Gemini/GPT models can also handle this now.

  6. ^

    This might be considered a form of inattentional blindness, the classic example of which is the guy in a gorilla suit walking through a basketball game.

  7. ^

    Probably helps that he can’t really remember enough of his experiences to get bored. That may be what we all do in the posthuman future, though on a longer timescale.

  8. ^

    9000 of those spent stuck on that left arrow spinner issue.

  9. ^

    Technically it still took Claude a few hours to notice the CUT-able trees inside the gym that block access to Erika, the gym leader, but he noticed eventually.

  10. ^

    The minimap only fills out as the LLM explores. Good prompting ensures that the LLM explores basically everything as a first priority, which means in practice the LLM always has a good map of the area it can understand. This bypasses a lot of vision and spatial reasoning weaknesses. Other key tools include an LLM-reasoning-powered navigator and the ability to place map markers.

Discuss

Liked Liked