Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — add METEOR as quality reward!
Setup: 3x Mac Minis in a cluster running MLX. One node drives training, two push rollouts via vLLM. Trained two variants: length penalty only (baseline) length penalty + quality reward (METEOR ) Eval: LLM-as-a-Judge Used DeepEval to build a judge pipeline scoring each summary on 4 axes: Faithfulness — no hallucinations vs. source Coverage — key points captured Conciseness — shorter, no redundancy Clarity — readable on its own Why METEOR in the quality reward? ROUGE-L only cares […]