Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — add METEOR as quality reward!

digitado ⋅ 17 de April de 2026

Setup: 3x Mac Minis in a cluster running MLX. One node drives training, two push rollouts via vLLM.

Trained two variants:

Eval: LLM-as-a-Judge Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

Why METEOR in the quality reward?

ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely.
METEOR handles both: it aligns tokens with synonym matching via WordNet and balances precision + recall with a chunk-order penalty.

(It’s also why there’s a threading lock around METEOR calls in the reward code — NLTK’s WordNet is not thread-safe )

Models + eval artifacts are on HuggingFace.

Like 0

Liked Liked