Nvidia: End-to-End Test-Time Training for Long Context aka Being Able To Update A Model’s Weights In Real-Time As You Use It | “TTT changes the paradigm from retrieving info to learning it on the fly…the TTT model treats the context window as a dataset & trains itself on it in real-time.” [R]

digitado ⋅ 15 de January de 2026

TL;DR:

The paper describes a mechanism that essentially turns the context window into a training dataset for a “fast weight” update loop:

Inner Loop: The model runs a mini-gradient descent on the context during inference. It updates specific MLP layers to “learn” the current context.
Outer Loop: The model’s initial weights are meta-learned during training to be “highly updateable” or optimized for this test-time adaptation

From the Paper: “Overall, our empirical observations strongly indicate that TTT-E2E should produce the same trend as full attention for scaling with training compute in large-budget production runs.”

Abstract:

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture a Transformer with sliding-window attention.

However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties.

In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7x faster than full attention for 128K context. Our code is publicly available.

Layman’s Explanation:

Think of this paper as solving the memory bottleneck by fundamentally changing how a model processes information. Imagine you are taking a massive open-book exam.

A standard Transformer (like GPT-4) is the student who frantically re-reads every single page of the textbook before answering every single question. This strategy guarantees they find the specific details (perfect recall), but as the textbook gets thicker, they get exponentially slower until they simply cannot finish the test in time.

On the other hand, alternatives like RNNs or Mamba try to summarize the entire textbook onto a single index card. They can answer questions instantly because they don’t have to look back at the book, but for long, complex subjects, they eventually run out of space on the card and start forgetting crucial information.

This new method, Test-Time Training (TTT), changes the paradigm from retrieving information to learning it on the fly. Instead of re-reading the book or summarizing it onto a card, the TTT model treats the context window as a dataset and actually trains itself on it in real-time. It performs a mini-gradient descent update on its own neural weights as it reads. This is equivalent to a student who reads the textbook and physically rewires their brain to master the subject matter before the test.

Because the information is now compressed into the model’s actual intelligence (its weights) rather than a temporary cache, the model can answer questions instantly (matching the constant speed of the fast index-card models) but with the high accuracy and scaling capability of the slow, page-turning Transformers.

This effectively decouples intelligence from memory costs, allowing for massive context lengths without the usual slowdown.

Link to the Paper: https://arxiv.org/pdf/2512.23675

Link to the Open-Sourced Official Implementation of End-to-End Test Time Training for Long Context: https://github.com/test-time-training/e2e

submitted by /u/44th–Hokage
[link] [comments]

Like 0

Liked Liked