Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis – using combination of quality rewards

digitado ⋅ 19 de April de 2026

Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — trying combination of quality rewards with length penalty!

So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO!

Why combination of quality rewards?

ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely.
METEOR handles both: it aligns tokens with synonym matching via WordNet and balances precision + recall with a chunk-order penalty.
BLEU on the other hand, focuses more on n-gram precision and length penalty. It does not care about recall which I think should make it perform less than METEOR metric as a reward and definitely above the sole length -only reward

Now, each of the above metric, keeping the length penalty as it is throughout, did not seem to increase as the training proceeded.

So, I though maybe the length penalty present in each of the above metrics is just fighting off the strict 64 token I have set (since the ground truth summaries were quite short comparatively – more details soon!)

So basically, I’ll be doing:

METEOR + BLEU
BLEU + ROUGE-L
METEOR + ROUGE-L

Models + eval artifacts are on HuggingFace.

Next: t-tests on combination rewards!

Setup: 3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM. Trained two variants:

→ length penalty only (baseline) → length penalty + quality reward (BLEU, METEOR and/or ROUGE-L )

Eval: LLM-as-a-Judge (gpt-5) Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

Faithfulness — no hallucinations vs. source
Coverage — key points captured
Conciseness — shorter, no redundancy
Clarity — readable on its own

https://preview.redd.it/otfz3bbf94wg1.png?width=800&format=png&auto=webp&s=b539e528f49c0df0889dc4b265176a755daf2448

submitted by /u/East-Muffin-6472
[link] [comments]

Like 0

Liked Liked