Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster

digitado ⋅ 2 de June de 2026

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is!

It’s about output length-constrained summarization using LLMs with GRPO. All experiments run on tiny LLMs – Qwen2.5-0.5B-Instruct and LFM-2.5-350M on a 3x Mac mini M4 cluster (16 GB each), single-node training with multi-node vLLM inference for rollouts.
The core question: can you teach a sub-500M model to summarize Reddit posts in exactly 64 tokens while keeping the quality high?

The baseline zero-shot answer: not really. Composite G-Eval scores of 2.376 (Qwen) and 2.332 (LFM) under zero-shot prompting, with pass rates of just 21% and 13%.

That was the starting point.

I tested 12 reward configurations across 2 training strategies:

Strategy 1 – Length-Penalty Fine-tuned (or staged curriculum): Train on length reward first → checkpoint → fine-tune with quality rewards only.
Strategy 2 – Length-Penalty Included (a.k.a joint): Length + quality rewards active simultaneously from step 1.

24 checkpoints total. One clear winner between the two strategies.

The quality reward signals:

ROUGE-L – LCS F1 against the reference
METEOR – precision/recall with stemming + synonym matching
BLEU – n-gram precision with a brevity penalty And all their pairwise combinations. Evaluated with G-Eval (LLM-as-judge) across Faithfulness, Coverage, Conciseness, and Clarity.

The staged curriculum wins – consistently.

Best composite scores:

LFM: 2.904 (quality-meteor, fine-tuned) vs 2.701 (joint)
Qwen: 2.817 (quality-bleu-rouge, fine-tuned) vs 2.769 (joint)

Practical takeaways:

Staged curriculum (length first, quality second) outperforms joint training in absolute score
METEOR + ROUGE-L is the most reliable reward combination under both strategies
The length constraint is also a regularizer – it prevents the Coverage ↔ Conciseness collapse that happens when quality rewards run unconstrained
BLEU alone is not worth including as a standalone reward signal for summarization

The infra was the other fun part.

Training on MLX (Apple Silicon, unified memory). Rollouts on distributed vLLM workers via smolcluster. Asynchronous – while the trainer computes gradients for step N, vLLM is already generating rollouts for step N+1.

Fitting full GRPO (policy + frozen ref model + activations + optimizer state) in 12 GB required chunked gradient accumulation, gradient checkpointing, and remote rollout generation. No LoRA, full bf16 parameters.

PS: All of this was done using smolcluster framework I made and it was really fun and tiring to train without OOMing!

Blog

Let me of any feedback or any further direction I should take with this project!

submitted by /u/East-Muffin-6472
[link] [comments]

Like 0

Liked Liked