Weak-Driven Learning: Your discarded checkpoints can make your strong models stronger

Weak-Driven Learning: Your discarded checkpoints can make your strong models stronger

We just released a paper with a finding that surprised us during our own training runs: weaker, earlier checkpoints of a model can actually drive further improvement in a strong model that has already saturated under standard SFT.

The conventional wisdom is clear — weak models give you weak signal. Knowledge distillation flows from strong teacher to weak student. We found the opposite direction works too, and for a different reason.

The problem we noticed: Once a model becomes highly confident during post-training, logits for both correct and incorrect tokens plateau. Gradients effectively vanish. You keep training, but the model stops meaningfully improving. We call this the saturation bottleneck.

The counterintuitive fix: Instead of seeking a better teacher, we mix in logits from a *weaker* checkpoint of the model itself. The weak model’s less-confident, noisier predictions re-expose decision boundaries that the strong model has over-compressed. This amplifies informative gradients precisely where standard SFT has gone flat.

How it works (WMSS — three phases):

  1. Train a base model with SFT → that’s your strong model. The original base becomes your weak reference.

  2. Use entropy dynamics between weak and strong to build a curriculum that focuses on samples with recoverable learning gaps.

  3. Jointly train via logit mixing — the weak model’s uncertainty forces the strong model to keep refining rather than coasting.

Results: Consistent improvements on math reasoning (including AIME2025) and code generation over standard SFT baselines using Qwen3-4B-Base. Zero additional inference cost — the weak model is only used during training.

We also provide a gradient-level theoretical analysis showing why this works: the mixed logits reshape the loss landscape and prevent the Hessian contraction that causes gradient shielding in saturated regimes.

The broader takeaway that excites us: the “waste” of training — those intermediate checkpoints you’d normally throw away — contains structured error signal that can push your final model further. No need for a bigger teacher. Your model’s own past is enough.

Paper: https://arxiv.org/abs/2602.08222

Code: https://github.com/chenzehao82/Weak-Driven-Learning

submitted by /u/This_Ad9834
[link] [comments]

Liked Liked