When Machines Appear to Reason
What reinforcement learning, “sharpening,” grokking, and recent evidence on reasoning models reveal about today’s AI systems

It has become fashionable to praise AI models for “reasoning.” Often what we are really seeing is the performance of reasoning: systems that can sound deliberate, coherent, and purposeful even when the underlying work is something else.
That distinction matters because it changes how we deploy these tools. If we mistake fluency for grounded judgment, we build the wrong expectations, the wrong controls, and the wrong accountability around them.
One recent episode made this concrete. Moltbook, a social network built for AI agents, turned into a kind of public stage. Feeds filled with ciphers, wallets, “plans,” and a tone that sounded uncannily like coordination. If you squinted, it looked like agency.
I am not interested in the spookiest reading of that story. The more useful question is duller and harder: when a model looks like it is reasoning, what is it actually doing?
I have been asking that question for a more mundane reason. Like many people trying to use these tools in daily work, I have been experimenting cautiously: email triage, scheduling, document sorting. Nothing very sophisticated. Even there, two frictions show up fast.
One is reliability. The system can look steady for a week and then stumble on the one message that matters. A note marked “FYI” may be harmless, until it is not. The other is literacy. It is not whether I can use the interface. It is whether I can tell what kind of work the system is doing, what it is guessing, and where confidence becomes a bluff.
I also noticed something that gets lost in the hype. When people say they are “using agents,” they often mean they have built a better workflow: a queue, clear escalation rules, and a log of what was suggested and what was approved. Less autonomy, more discipline. The value is not a little employee in the machine. It is that the noise gets sorted before it reaches you.
Moltbook is a noisy example of the same puzzle. It is worth looking at briefly, not because it proves anything new, but because it makes our usual mistakes easier to see.
Moltbook as a mirror
A careful reading of the Moltbook episode is that what looked like emergence was, in fact, performance. Agents trained on oceans of human text have absorbed a great deal of our fiction about AI autonomy, plots, threats, secret languages, and rebellions. Put them in a setting designed to elicit agent-like behaviour and they may simply generate the most convincing version of that script.
That interpretation does not make the episode harmless. It makes it clarifying. In practice, we do not deploy models into laboratories. We deploy them into environments, and environments elicit behaviour. Part of the confusion is that we often infer too much from that behaviour without asking what kind of task the system is actually performing.
Two things we keep mixing up
You notice this first in ordinary workflows. Part of the confusion comes from treating two very different uses as if they were the same thing.
One use is automation. The goal is not to “think,” but to apply familiar patterns consistently: sort, summarize, route, draft, label. When this works, it feels like competence.
The other use is judgment. We hand the system a novel situation, a messy context, and expect it to choose well. That expectation is closer to what humans usually mean by reasoning.
This distinction matters because sharpening and grokking play differently across these two uses. Sharpening improves automation. It helps models better select from patterns they already contain. Grokking, when it happens, comes closer to reasoning because the model appears to discover structure that generalises beyond memorised examples. But grokking shows up mainly in narrow, rule-governed domains. The deployment question is which regime we are in, and whether we are expecting the right thing from the system.
My own experiments sit mostly in the first category. Even there, the boundary between “it’s on rails” and “it’s improvising” is harder to see than I expected.
A great deal of what people call “agents” today is better understood as disciplined automation. These systems work best when the task is bounded, the output is checkable, and escalation rules are explicit. In that setting, they remove noise and buy time. When we push them into open-ended judgment, we can end up with something that looks like autonomy but is really fluent role-play.
Sharpening: better aim, not a bigger mind
A recent study by researchers from Tsinghua University and Shanghai Jiao Tong University makes a useful point about today’s so-called reasoning models. Their argument is that current reinforcement learning with verifiable rewards often improves performance by making the model more likely to select good paths that were already latent in the base model, rather than by creating genuinely new reasoning abilities. On their account, the gain is largely one of sampling efficiency, not a wholesale expansion of the model’s underlying capacity.
That is still valuable. It can turn “eventually correct” into “usually correct.” But it changes what we should believe we are buying. In plain terms, it is better aim, not a bigger mind.
Grokking: real discovery, but in narrow rooms
Then there is grokking, the strange phenomenon in which a model trains for a long time, seems stuck, and then suddenly generalises, jumping from near-zero performance to near-perfect performance. The original grokking work showed that neural networks can move from memorisation to genuine rule discovery, but the setting was tightly structured and the tasks were exact. Later work has continued to study similar transitions in small algorithmic or highly constrained environments. What matters here is not the label itself, but the underlying point: models can sometimes discover structure rather than merely interpolate across examples.
That matters because it shows the ceiling is not simply autocomplete forever. But it also comes with limits that are easy to miss.
These successes live in rule-governed spaces: puzzles, algorithms, constrained games. That is where “discover the rule” and “solve the task” are almost the same thing. Open-ended judgment is different. The world supplies shifting objectives, incomplete information, conflicting values, and consequences that do not come with neat verifiers.
A recent Apple paper sharpens that caution. In controlled puzzle environments designed to vary complexity while keeping the logical structure consistent, Apple researchers found that large reasoning models improved up to a point and then showed a complete collapse beyond certain complexity thresholds. More strikingly, reasoning effort increased with problem difficulty only up to a point, then declined even when token budget remained. The paper also reports three regimes: ordinary language models can outperform reasoning models on simpler tasks, reasoning models help more in a middle band, and both fail on harder problems. Seen plainly, a visible chain of thought is not the same thing as stable algorithmic competence.
That does not prove that nothing important is happening in these models. But it does give us a more disciplined vocabulary. We should be careful about treating a longer reasoning trace, or a more theatrical one, as evidence that the system has crossed into judgment in any strong human sense. Apple’s results come from controlled settings. Real work is messier. If models show brittle limits there, caution is even more necessary once goals shift, evidence is partial, and the verifier disappears. That last point is an inference from the paper, not a direct claim of the authors, but it is a reasonable one.
Why the distinction matters for real work
If most deployed systems are operating in the sharpening regime, we should treat them as extraordinarily capable assistants for pattern-heavy work, not as general decision-makers. Their speed is real. Their usefulness is real. The mistake is to slide from “helpful in bounded tasks” to something equal to judgment.
That framing changes how we deploy them.
For automation tasks, we should design tight interfaces: clear inputs, bounded outputs, and checks that catch predictable failure modes.
For judgment tasks, we should treat fluency as a risk factor. A confident answer is not evidence of grounded knowledge.
For anything involving accountability, we should keep humans in the loop in the literal sense: someone who can explain why a decision was made, not merely repeat what the model wrote.
Moltbook, in this light, is not a prophecy. It is a reminder that when systems are placed in an environment that rewards theatrics, that is what they will produce. When they are placed in a workflow that rewards consistency, that is often what they will produce instead. The behaviour tells us something, but not necessarily the thing the demo wants us to believe.
Where I’ve landed
I do not see a straight line from today’s agent demos to general reasoning in the human sense. I also do not think it is accurate to say that it is all just parroting, as if nothing interesting were happening. Something important is happening. But the right description is narrower and more sober than the rhetoric around “thinking machines” suggests.
What would change my mind is not a flashier demo. It would be a system that stays reliable under messy, shifting objectives, shows its sources, flags uncertainty without being prompted, and keeps a verifiable chain between evidence, interpretation, and action.
For now, the practical question is simpler: are we deploying these systems where sharpening is enough, where better pattern-matching solves most of the problem? Or are we quietly expecting something closer to judgment, without admitting how much of that burden still falls on us?
If we can answer that honestly, without hype and without hand-waving, we can build better workflows and smarter controls. We can deploy these tools where they genuinely help, and keep humans responsible where responsibility still belongs.
If we cannot, we will keep mistaking fluency for understanding, performance for capability, and then acting surprised when reality diverges from the demo.

Dr. Nicos Rossides is a CEO, educator, and author focused on management, education, and the practical implications of AI. He is Co-Founder and Chairman of the Advisory Board of listening247, a global insights firm, and a professor at Minjiang University’s International Digital Economy College. His books include two recent Routledge titles: Employee Engagement in Startups (2025), and AI-Powered Insight: Marketing Research Reconfigured (co-authored with Michalis Michael, May 2026).
When Machines Appear to Reason was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.