Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — add METEOR as quality reward!

Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — add METEOR as quality reward!

Setup: 3x Mac Minis in a cluster running MLX. One node drives training, two push rollouts via vLLM.

Trained two variants:

  • length penalty only (baseline)
  • length penalty + quality reward (METEOR )

Eval: LLM-as-a-Judge Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

  • Faithfulness — no hallucinations vs. source
  • Coverage — key points captured
  • Conciseness — shorter, no redundancy
  • Clarity — readable on its own

Why METEOR in the quality reward?

  • ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely.
  • METEOR handles both: it aligns tokens with synonym matching via WordNet and balances precision + recall with a chunk-order penalty.

(It’s also why there’s a threading lock around METEOR calls in the reward code — NLTK’s WordNet is not thread-safe )

Models + eval artifacts are on HuggingFace.

https://preview.redd.it/cn8lm9pf8rvg1.png?width=800&format=png&auto=webp&s=0d1c10702531f0684bbd62f835ebf96a074f0123

submitted by /u/East-Muffin-6472
[link] [comments]

Liked Liked