Hierarchical Reasoning Models: When 27M Parameters Outperform Chain-of-Thought

digitado ⋅ 26 de January de 2026

Author(s): Kyouma45 Originally published on Towards AI. Paper-explained Series 4 TL;DR Most AI models “reason” by talking themselves through problems using chain-of-thought, which is slow, brittle, and expensive.This article explains a different idea called the Hierarchical Reasoning Model (HRM). Instead of reasoning in words, HRM thinks silently in layers, similar to how the human brain separates planning from execution. A slow, high-level module decides what to do, while a fast, low-level module figures out how to do it — repeating this cycle until the problem is solved. Despite being much smaller and trained on very little data, HRM solves hard problems (like complex Sudoku, mazes, and abstract puzzles) that even large language models fail at. The big takeaway: better reasoning doesn’t necessarily come from bigger models or longer explanations — sometimes it comes from better internal structure. In this article, we will cover A-Z about HRMs… link to original paper: https://arxiv.org/pdf/2506.21734 Core Philosophy and Architecture HRM doesn’t reason by talking to itself — it reasons by thinking longer. The Hierarchical Reasoning Model (HRM) is designed to overcome a core limitation of modern Transformers: fixed computational depth. No matter how long the input or how hard the task, a standard Transformer always performs the same number of layer-wise computations. This fundamentally limits its ability to carry out long-horizon reasoning, search, and backtracking. HRM takes a different approach. Instead of increasing depth through more layers or longer chains of thought, it decouples computation time from architectural depth. This allows the model to think longer internally — inside its hidden states — before producing an output. This idea is known as latent reasoning: reasoning happens in continuous internal representations rather than explicit text tokens. What HRM Actually Is (Architecturally)? Despite being conceptually different, HRM is still built from Transformer blocks: Both reasoning components are encoder-only Transformers They use full self-attention (not linear or causal attention) Modern enhancements are included: Rotary Positional Encoding, RMSNorm, gated feed-forward layers HRM is trained in a sequence-to-sequence setup, just like standard models It is not autoregressive in the usual sense — reasoning does not proceed token by token, but through recurrent state updates So HRM does not replace Transformers — it reuses them inside a recurrent, hierarchical loop. The Two-Module Hierarchical Structure: HRM splits reasoning across two tightly coupled recurrent modules that operate at different timescales, mirroring how the brain separates high-level planning from low-level execution. High-Level Module (zᴴ) — The Planner Updates slowly Responsible for abstract reasoning, strategy, and long-term planning Updates once per reasoning cycle Provides a stable, global context (high-dimensional latent state vector) that guides lower-level computation You can think of this module as deciding what kind of solution strategy should be pursued next. Low-Level Module (zᴸ) — The Executor Updates rapidly Handles detailed, local computation such as search, constraint propagation, or refinement Runs for T internal steps for every single update of the high-level module Performs intensive computation while the high-level state remains fixed This is the part that does the “heavy lifting” within a given plan. The Interaction Loop (How Reasoning Actually Happens) HRM reasoning unfolds through repeated hierarchical cycles: Input EncodingThe input x is mapped into a latent working representation (x̄) using an embedding network.For example: Converts discrete tokens (or grid values like Sudoku cells) into continuous vectorsNote: Uses full self-attention Low-Level Computation (Inner Loop)With the high-level state held constant, the low-level module iterates for T steps: It updates based on its previous state It attends to the input representation It is guided by the current high-level context During these steps, the low-level module converges toward a local equilibrium-effectively performing search or refinement. 3. High-Level Update (Outer Loop) After T steps, the high-level module: Observes the final low-level state Updates its own state to reflect progress Establishes a new global context Reset and Restart The low-level module is now exposed to a new high-level state. This resets its convergence, allowing it to begin a fresh computational phase instead of stalling. This process — called hierarchical convergence — allows HRM to perform deep, multi-stage reasoning while remaining stable and efficient Why This Matters Compared to Transformers + CoT Transformers rely on depth in space (layers) → HRM uses depth in time Chain-of-thought externalizes reasoning into text → HRM keeps reasoning internal and continuous Standard RNNs converge too early → HRM avoids this via hierarchical resets Backpropagation Through Time is expensive → HRM uses a one-step gradient approximation with constant memory The result is a model that can execute algorithmic reasoning, search, and backtracking — all without generating intermediate reasoning text. Mathematical Foundations HRM’s ability to reason deeply without exploding memory or training instability comes from two tightly connected mathematical ideas:(1) how computation unfolds over time and(2) how gradients are computed without backpropagating through that entire history. I. Hierarchical Convergence A core problem with standard recurrent models is premature convergence. In a typical RNN, repeated application of the same update function causes the hidden state to quickly settle into a fixed point. Once this happens, updates become negligibly small, effectively halting computation — the model stops “thinking,” even if more steps are allowed. Note: It is premature convergence is different from Over-Smoothing problem also faced by RNNs. HRM avoids this failure mode through hierarchical convergence. How it works: During a reasoning cycle, the low-level module (L-module) repeatedly applies the same update function while the high-level state is held fixed. Under these conditions, the L-module naturally converges toward a local fixed point z_L^*, representing a locally consistent partial solution under the current strategy. Crucially, once the cycle ends, the high-level module updates its state, changing the global context. This update reshapes the solution landscape, meaning the old fixed point is no longer valid. As a result, the low-level module is forced to diverge and recompute, beginning a new convergence process toward a different equilibrium. This alternating pattern — local convergence followed by deliberate disruption — allows HRM to sustain long-running, meaningful computation instead of collapsing early like standard RNNs II. Fixed Point Theorem & […]

Like 0

Liked Liked