Why ChatGPT Feels Like Magic While Siri Feels Dumb.

Author(s): Suchitra Malimbada Originally published on Towards AI. Understanding the fundamental architectural shift from sequential processing to parallel attention, and why it enabled GPT-5’s capabilities that were impossible for LSTMs Created by author GPT-5 scored 100% on the 2025 American Invitational Mathematics Examination. It handles 400,000-token contexts without breaking a sweat. It learns new tasks from examples in the prompt without weight updates. Ten years ago, the state-of-the-art LSTM struggled with dependencies spanning more than 13 tokens. The gap isn’t just about scale. Throwing 1.8 trillion parameters at an LSTM architecture wouldn’t give us GPT-5’s capabilities. The difference is architectural, and it runs deeper than most explanations suggest. This isn’t about transformers being “better” at the same task. Transformers enable algorithms that have no LSTM equivalent. Consider what happens when GPT-5 processes the sentence “The trophy would not fit in the suitcase because it was too large.” To resolve what “it” refers to, the model performs a single parallel computation across all tokens. The connection from “it” to “trophy” happens in constant time, regardless of distance. An LSTM compresses information about “trophy” and “suitcase” into a fixed-size hidden state before seeing “it,” passing through eight intermediate states that mix and compress at each step. The architecture fundamentally cannot perform the same computation. This architectural distinction cascades into everything modern LLMs do. In-context learning requires comparing tokens directly to recognize structure. Chain-of-thought reasoning requires attending back to previously generated steps. Both capabilities emerge from transformer architectures and remain impossible for RNNs, regardless of parameter count. Table of Contents The Mathematics of Information Flow Why Sequential Processing Creates a Ceiling Context Windows: 13 Tokens vs 400,000 Tokens Emergent Capabilities That Need Attention Scale Enablement and Parallelization Real-World Evolution: Siri’s Transformation Architecture Determines Capability Ceilings The Mathematics of Information Flow Self-Attention: Parallel Information Access The transformer’s self-attention mechanism computes relationships between all positions simultaneously: Attention(Q, K, V) = softmax(QK^T / √d_k) · V Every query vector computes dot products with every key vector in a single matrix multiplication, producing an n×n attention matrix in constant sequential operations. When GPT-5 processes a 400,000-token document, position 399,999 can attend directly to position 0. The path length between any two tokens is always one. For a sequence of length n and dimensionality d, self-attention requires O(n²·d) operations. The quadratic scaling makes naive attention expensive, but O(1) sequential operations mean every computation runs in parallel across thousands of GPU cores. GPT-5 trains on clusters with tens of thousands of accelerators, computing attention for billions of tokens simultaneously. LSTM: Sequential Compression The LSTM processes sequences through recurrent updates to a hidden state: f_t = σ(W_f [h_{t-1}, x_t] + b_f) # forget gatei_t = σ(W_i [h_{t-1}, x_t] + b_i) # input gateo_t = σ(W_o [h_{t-1}, x_t] + b_o) # output gateC̃_t = tanh(W_C [h_{t-1}, x_t] + b_C) # candidate cell stateC_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t # cell state updateh_t = o_t ⊙ tanh(C_t) # hidden state At each timestep, the network updates its hidden state based on the previous state and current input. Information from early tokens must pass through every intermediate state to reach later positions. For a sequence of length n, the path length is O(n). Processing 400,000 tokens requires 400,000 sequential operations that must execute sequentially. Modern GPUs have 10,000+ cores sitting idle while the LSTM processes one timestep at a time. The Path Length Problem Path length determines what dependencies a network can learn. Gradients during backpropagation must travel the same paths as forward information flow. In an LSTM processing a 100-token sequence, gradients pass through 100 hidden state updates to reach the first token. Each passage multiplies the gradient by a factor less than one. After 20–30 steps, gradients decay by a factor of 1⁰⁷. The LSTM’s gates mitigate vanishing gradients to a point, enabling dependencies across 20–50 tokens where vanilla RNNs fail at 5–10. But the fundamental issue remains: information must traverse n states, and gradients must travel back along that same path. Transformers sidestep this entirely. Every position connects to every other position with path length one. Gradients flow directly from outputs to inputs without compression, making long-range dependency learning tractable regardless of sequence length. Why Sequential Processing Creates a Ceiling The Fixed-Size Hidden State An LSTM with hidden dimension 1024 compresses all previous context into 1024 floating-point numbers. When processing position 1000, the hidden state contains a lossy summary of the previous 999 tokens. UC Berkeley research quantified this bottleneck precisely. They trained LSTM language models with different n-gram orders, where the model could only look back n tokens. An LSTM with arbitrary context length performed identically to an LSTM with n=13. Beyond 13 tokens, additional context provided no benefit. The hidden state saturated. This isn’t a training failure. It’s architectural. Modern transformers use attention over 400,000 tokens because tasks require it. Document summarization, codebase understanding, and long-form reasoning need to reference information thousands of tokens in the past. Information Loss and Parallel Computation Each LSTM timestep applies a learned compression function combining the previous hidden state with current input. These functions are optimized during training but remain lossy. Information about which specific words appeared and in what order degrades with each state transition. Transformers never compress context. The full sequence of embeddings remains accessible at every layer. When GPT-5 generates token 50,000, it can attend back to token 1 with the same fidelity as token 49,999. Consider translating a technical document with acronyms. “The Convolutional Neural Network (CNN) architecture…” appears at token 100. At token 5,000, the text references “the CNN.” A transformer attends directly back to token 100. An LSTM’s hidden state has compressed away the specifics, likely failing the translation. The sequential bottleneck prevents LSTMs from leveraging parallel hardware. Training GPT-5’s 1.8 trillion parameters required processing trillions of tokens across thousands of GPUs, each computing attention for different positions simultaneously. An LSTM processing 400,000 tokens runs 400,000 sequential steps. No amount of GPUs can parallelize this dependency chain. Training an LSTM with 1.8 […]

Liked Liked