The Architecture Mismatch at the Heart of Modern AI

Author(s): Marc Bara Originally published on Towards AI. Photo by Google DeepMind on Unsplash We have exactly one example of general intelligence: the human brain. We are spending hundreds of billions trying to build another with AI. And we are not copying the one that works. Why? The answer has less to do with science than with which hardware happened to be available. This is not obvious at first glance. Over the last few years, research comparing brains and large language models has revealed genuine convergence: both rely on prediction, hierarchical representations, error correction, compression. But they diverge sharply in how those principles are implemented. The human brain processes information sequentially, causally, under tight energy constraints, while modern AI systems rely on massive parallelism, frozen weights during inference, and hardware optimized for matrix multiplication at scale, the exact operation GPUs were built to accelerate. This gap is not an accident, and it is not primarily scientific. It is the result of architectural choices shaped by hardware, tooling, and capital investment. Changing what models predict is comparatively easy. Changing how they compute, and what physical substrate supports that computation, is much harder. Understanding this mismatch is essential if we want to understand both the limits of today’s AI systems and the kinds of intelligence they are structurally capable of producing. The science: where brains and LLMs converge and diverge Research comparing LLMs to brains has produced findings that are both encouraging and sobering. A 2024 study in Nature Machine Intelligence found that as LLMs advance, they become more brain-like: models with better performance show greater correspondence with neural activity patterns, with alignment peaking in intermediate layers. This suggests convergent computational principles: prediction as a core operation, hierarchical representations, statistical learning, error-driven updates. Photo by Milad Fakurian on Unsplash But the architectural differences matter. Unlike transformers, which process hundreds or thousands of words simultaneously, language areas analyze input serially, word by word, recurrently and temporally. Human attention is guided by goals, emotions and novelty; it fluctuates and is limited. LLM attention is purely algorithmic. The brain uses approximately 20 watts for 86 billion neurons; LLMs require megawatts. And a NeurIPS 2024 paper found that much of the neural encoding performance attributed to LLMs is driven by simple features like sentence length and position, urging caution before drawing strong conclusions about cognitive similarity. This is where most discussions stop. The next question is harder: if brains and LLMs use similar principles, why are they built so differently? At this point, a common objection appears: intelligence may be achievable through different physical substrates. Biological brains evolved under constraints radically different from silicon, including energy budgets, material availability, and the need for continuous online learning. Perhaps transformers represent a valid alternative path to intelligence, one that trades biological elegance for brute parallel computation. This argument has merit. But it does not address efficiency, scalability to embodied agents, or the specific cognitive capacities that current architectures demonstrably lack. The question is not whether transformers can be intelligent in some sense, but whether they can be intelligent in the ways that matter for the problems we want to solve. That distinction matters once we move from abstract intelligence to systems that must act, learn, and adapt in the world. Why transformers won: hardware, not biology Transformers did not triumph because they resemble brains. They triumphed because they fit the hardware. The key advantage is parallelization: transformers have no recurrent units, so they can process entire sequences simultaneously during training. As the Mamba paper notes, RNNs and LSTMs were more brain-like in important ways: sequential processing, state maintenance, temporal integration. But the recurrent process does not exploit modern GPUs, which were designed for parallel matrix operations. Training recurrent models was slow; training transformers was fast. Photo by Taylor Vick on Unsplash The brain does not need transformers because it operates under different constraints. It runs in real-time, causally integrated with the world. It has a 20-watt power budget. It performs many integrated tasks simultaneously: perception, action, homeostasis, emotion. It learns continuously during operation, not in separate training and inference phases. Transformers work despite being architecturally different from brains, compensating with scale: more parameters, more data, more compute. LeCun’s critique: correct about objectives, silent about architecture This distinction (between what models predict and how they compute) becomes clearer when applied to the most prominent critique of LLMs. Yann LeCun’s scientific claim is straightforward: models whose core task is predicting the next token cannot achieve true understanding, reasoning, or human-like intelligence, regardless of scale. He has called autoregressive LLMs insufficient for human-level intelligence, or even cat-level intelligence. This claim is defensible. Token prediction is a weak training signal for world modeling, planning, and causal reasoning. His proposed solution is JEPA: Joint Embedding Predictive Architecture. Instead of predicting tokens, JEPA predicts continuous embeddings (numerical vectors representing meaning) in a shared semantic space. Instead of reconstructing raw inputs, it predicts abstract representations. This is a meaningful change at the objective layer. JEPA learns to predict states of the world rather than words about the world. But here is the architectural continuity that goes largely unnoticed. LeCun himself clarified: “JEPA is not an alternative to transformers. In fact, many JEPA systems use transformer modules. It is an alternative to Auto-Regressive Generative Architectures, regardless of whether they use transformers.” The technical details confirm this. I-JEPA consists of three Vision Transformers: context encoder, predictor, and target encoder. V-JEPA uses the same backbone. At the architecture layer, JEPA is still transformers with parallel attention, backpropagation, and GPU-optimized matrix operations. A Slashdot commenter captured this precisely: “The block diagram for his JEPA solution is the same thing just predicting next floating latent space token instead of discrete word-token. Which is very powerful and cool but I mean it’s not like he is getting rid of backprop or convolution or even attention really.” LeCun changes what the model predicts. He does not change how the model computes. If the architectural mismatch between transformers and brains matters, JEPA […]

Liked Liked