The Eternal Junior: Why AI Computes but Does Not Think

We love to anthropomorphize our tools. When a Large Language Model produces a brilliant piece of system architecture or a perfectly structured product requirement document, it is tempting to pause and wonder: Is the machine actually thinking?

The philosophical counterpunch is well-rehearsed. Aren’t humans just advanced prediction engines? Aren’t we drawing on our own training data, memories and experiences, to guess what we should do or say next, perhaps just running on a noisier temperature setting?

It is a legitimate question, and anyone building with these tools owes it to themselves to take it seriously rather than dismiss it. But taking it seriously means examining what both humans and machines actually do when they produce intelligent-seeming output, and that examination reveals a gap that no amount of scaling has yet closed.


Why This Philosophical Debate Is Older Than You Think

The question of whether a machine can think did not begin with ChatGPT. It arguably begins with Alan Turing’s 1950 paper “Computing Machinery and Intelligence”, which reframed the question from “Can machines think?” to the operational “Can machines do what we (as thinking entities) can do?” — the famous Imitation Game.

Three decades later, John Searle attacked Turing’s framing head-on with the Chinese Room Argument (1980). In his thought experiment, a person in a sealed room follows syntactic rules to manipulate Chinese characters and produces outputs indistinguishable from a native speaker. Searle’s conclusion: syntax alone does not produce semantics. Symbol manipulation, no matter how sophisticated, does not constitute understanding. Current LLMs are, in a very literal sense, the most elaborate Chinese Room ever constructed, manipulating tokens according to learned statistical distributions without any access to what those tokens mean.

The counterargument comes from Daniel Dennett’s intentional stance framework (1987). Dennett argues that if a system’s behavior is reliably predicted by attributing beliefs and desires to it, then the distinction between “real” understanding and “simulated” understanding may be vacuous. For Dennett, there is no privileged inner light of consciousness. There are only patterns of behavior that warrant or fail to warrant intentional descriptions. If you treat the LLM as though it understands your codebase to get useful work out of it, what exactly is the difference between “real” understanding and “simulated” understanding?

The difference, according to David Chalmers, is the hard problem of consciousness (1995). Even if we could explain every functional, computational, and behavioral property of a system, we would still face the question: what is it like to be that system? Functional equivalence does not entail experiential equivalence. An LLM may produce outputs identical to a thinking being without there being anything it is like to be that LLM. Thomas Nagel made the same point two decades earlier in “What Is It Like to Be a Bat?” (1974): We can completely understand the physics, biology, and neurology of how a bat uses echolocation to fly around in the dark. We can map the entire physical computation of that process. But no matter how much objective data we collect, we will never actually know the internal, subjective feeling of being that bat experiencing that sonar. That internal feeling of awareness is the “what it is like.”

The answer is that this debate remains open. But for practitioners, we do not need to resolve the hard problem of consciousness to make a practical observation: the way these systems produce outputs is mechanistically different from the way humans produce insights, and that difference matters enormously for how we should use them.


The Eternal Junior

If you have worked in any engineering organization, you recognize the archetype immediately. There is a particular kind of junior contributor whose entire workflow consists of pattern-matching against existing work. They find a similar pull request, replicate its structure, and produce something that is often syntactically correct and occasionally impressive in its coverage. But they never ask why the architecture looks the way it does. They never push back on a requirement. They never say “this whole approach is wrong, here’s what we should do instead.”

A few weeks back on LinkedIn, I said that a Junior Developer is what a Large Language Model is. Except its pattern library is not a few hundred pull requests — it is the entire public internet, refined through reinforcement learning from human feedback (RLHF) to produce outputs that human evaluators rate as helpful, harmless, and honest. Its recall is superhuman. Its synthesis across domains is genuinely remarkable. But it is, at its core, an optimization process over a static probability distribution. It selects the next token that is most likely to satisfy the learned objective.

The metaphor holds, but my original comment overstated the case by calling the AI “un-coachable.” That is factually wrong, so I will try to redeem myself a little bit here. Fine-tuning, RLHF, retrieval-augmented generation, system prompts, all demonstrably shape model behavior. You can hand this junior another shelf of reference books (RAG), and you can train it to prefer certain coding styles over others (fine-tuning), and you can give it real-time access to documentation. These are meaningful interventions. What you cannot do is give it the silver bullet that transforms a junior into a senior: the capacity to form independent judgment about what should be built.

The critical distinction between a human junior and an AI junior is trajectory. A human junior accumulates not just patterns but what Michael Polanyi called tacit knowledge: the inarticulate understanding that we possess but cannot fully specify in explicit rules. “We can know more than we can tell.” Over time, a human develops what practitioners vaguely call “taste” or “intuition”: the ability to sense that an architecture is fragile before they can articulate why, or to feel that a product direction is wrong before the data confirms it. This tacit dimension of expertise is precisely what Polanyi’s Paradox identifies as the hardest thing to automate.

Hubert Dreyfus made essentially the same argument from a phenomenological perspective. Drawing on Heidegger’s concept of being-in-the-world. The idea that human understanding is grounded in practical, embodied engagement with our environment, not in detached symbol manipulation. Dreyfus argued in What Computers Can’t Do (1972) that AI would fail wherever expertise depends on the kind of holistic, situated judgment that cannot be decomposed into explicit rules. The LLM has no body, no workshop, no experience of tools breaking in its hands. It has statistical associations between tokens.


The Serendipity Problem

I often use the invention of penicillin as proof that breakthroughs come from accidents. It’s a simplified view, yes. The history is more instructive than that. Fleming did not stumble onto penicillin through pure luck. He had been studying antibacterial substances for years and had already discovered lysozyme in 1922. When the contaminated petri dish appeared, he was prepared to see it for what it was. Louis Pasteur’s dictum applies: “Chance favors only the prepared mind.”

I am using this argument to point out a specific kind of cognitive readiness that LLMs structurally lack. The creative act in Fleming’s case was not the contamination (the accident) but the recognition (the judgment). He saw an anomaly and chose to pursue it rather than discard it. That choice required motivation, domain knowledge, and a willingness to deviate from his planned experiment.

Karl Popper formalized this insight. In Conjectures and Refutations (1963), he argued that scientific progress does not come from accumulating confirming observations (the inductive method) but from bold conjectures that go beyond existing evidence, followed by ruthless attempts at refutation. Discovery requires proposing something the data does not yet support and then trying to prove yourself wrong. LLMs are architecturally incapable of this. They do not conjecture; they interpolate. They do not seek falsification; they seek the highest-probability completion.

Thomas Kuhn’s The Structure of Scientific Revolutions (1962) tells the same story at a larger scale. Science does not progress through steady accumulation within a fixed framework. It advances through paradigm shifts triggered by anomalies that the existing paradigm cannot absorb. The researchers who drive revolutions are precisely those who refuse to ignore the anomalies. An LLM, trained on the existing paradigm’s literature, will reproduce that paradigm’s assumptions. It is structurally a normal-science engine, not a revolutionary-science engine.

The Dirac story is a perfect illustration. Paul Dirac did not discover antimatter by following the consensus. His 1928 equation produced solutions with negative energy, a result that violated classical physics. Rather than discarding them as mathematical artifacts (the statistically probable move), he took them seriously and predicted the existence of the positron, which Carl Anderson confirmed experimentally in 1932. This was not pattern-matching. It was a theoretical commitment to a mathematically coherent but empirically unprecedented conclusion.


What These Tools Actually Are

None of this makes LLMs useless. It makes them something specific: extraordinarily powerful interpolation engines with superhuman recall.

AlphaFold mapping the structures of over 200 million proteins is a genuine scientific achievement. But AlphaFold did not decide to solve protein folding. Demis Hassabis and John Jumper applied decades of institutional knowledge at a well-defined computational problem, predicting 3D structure from amino acid sequence, and built a specialized architecture to solve it. The human contribution was the problem formulation, the judgment that this problem was solvable with this approach, and the decision to allocate resources to it.

The “Sparks of AGI” paper from Microsoft Research (2023) documented that GPT-4 exhibits surprising capabilities across mathematics, coding, medicine, and law, sometimes approaching human-level performance. This is real, and it should be taken seriously. But the same paper explicitly noted that advancing further may require “a new paradigm that moves beyond next-word prediction.” The researchers who were most impressed by the capabilities were also the ones most clearly articulating the architectural ceiling.

Jason Wei documented over 100 emergent abilities in large language models, capabilities that appear suddenly at scale and were not present in smaller models. This is genuinely surprising and challenges naive dismissals of LLMs as “just” autocomplete. But emergent capabilities within a fixed architecture are not the same as open-ended cognitive development. A calculator with enough digits can produce surprising numerical patterns, but it has not learned mathematics.

Andy Clark and David Chalmers’ extended mind thesis (1998) offers perhaps the most productive framing. They argue that cognition does not stop at the skull. Tools, notebooks, and computational devices can become genuine parts of our cognitive processes when they are reliably coupled to our decision-making. Under this view, the LLM is not a thinker and not merely a tool. It is a cognitive prosthetic, an extension of human thought that amplifies certain capabilities (recall, synthesis, translation between domains) while contributing nothing to others (judgment, motivation, values, commitment).

This framing avoids both the hype (“it’s thinking!”) and the dismissal (“it’s just autocomplete”). The LLM becomes part of a thinking system only when a thinking human is in the loop.


The Practitioner’s Framework

Wittgenstein argued in Philosophical Investigations (1953) that the meaning of a word is its use in a language-game. The meaning is not some private mental content but a function of how language is deployed in practice. There is an irony here: the LLM is perhaps the most sophisticated instantiation of Wittgenstein’s use-theory ever built. It has learned the use of language with extraordinary fidelity. What it lacks is what Wittgenstein called a “form of life”: the lived context of needs, intentions, social relations, and embodied experience within which language-games are played.

Bender, Gebru, McMillan-Major, and Shmitchell coined the term “stochastic parrots” (2021) to capture exactly this gap: LLMs produce language without understanding the communicative intent behind it. The metaphor is imperfect; parrots do not generalize across domains the way LLMs do, but the core point holds. Fluent output is not the same as understood output.

Kahneman’s dual-process theory provides a useful analogy for practitioners. System 1 (fast, automatic, pattern-matching) and System 2 (slow, deliberate, effortful reasoning) work together in human cognition. LLMs are something like an externalized System 1 with a vast pattern library: fast, fluent, and confidently wrong in exactly the ways that System 1 is confidently wrong. The human in the loop provides the System 2 oversight: checking the reasoning, questioning the assumptions, catching the hallucinations.

Chomsky, Roberts, and Watumull argued in their 2023 New York Times essay “The False Promise of ChatGPT” that LLMs achieve superficial linguistic competence without the generative grammar that underlies human language capacity. To translate this hyper-Linguistic sentence: The problem is that the LLM’s mimicry is so incredibly good that it fools us. It has achieved “superficial linguistic competence.” It sounds so human that we project human intelligence onto it. But the moment you push an LLM into an edge case that wasn’t well-represented in its training data, the illusion shatters. It starts hallucinating because it never actually understood the rules of the game. It only memorized the previous moves.

So here is the practical framework:

  1. Treat the LLM as a cognitive prosthetic, not a cognitive agent. It extends your recall and synthesis. It does not replace your judgment.
  2. Give it the tightest possible context. The quality of the output is a direct function of the quality of the input. Garbage prompt, garbage architecture.
  3. Verify everything. Not because the tool is bad, but because interpolation engines are systematically overconfident on out-of-distribution inputs, and you cannot always tell when you have crossed that boundary.
  4. Reserve the creative leaps for yourself. Problem formulation, architectural bets, and the decision to pursue an anomaly rather than discard it. These remain human contributions.
  5. Do not confuse fluency with understanding. The most dangerous failure mode is not the obvious hallucination. It is the confidently wrong answer that sounds exactly like a right one.

Conclusion

The question “Is the AI thinking?” is the wrong question for practitioners. The right question is: “What kind of cognitive work is this tool doing, and what kind must I still do myself?”

The answer, supported by six decades of philosophy of mind and three years of intensive empirical work on LLMs, is clear. The machine computes. It interpolates. It synthesizes patterns across a corpus of human knowledge that no individual could hold in memory. These are genuine, valuable capabilities.

But it does not conjecture. It does not commit to a theory in the face of skepticism. It does not feel the pull of an anomaly or the discomfort of a paradigm under strain. It does not bring what Popper called “bold conjectures” or what Kuhn called “revolutionary science” or what Polanyi called “personal knowledge.” It does not have a form of life.

You bring the variance. The machine brings the volume. Confusing the two is the most expensive mistake you can make.

Liked Liked