Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis – using combination of quality rewards
|
Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — trying combination of quality rewards with length penalty! So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO! Why combination of quality rewards?
Now, each of the above metric, keeping the length penalty as it is throughout, did not seem to increase as the training proceeded. So, I though maybe the length penalty present in each of the above metrics is just fighting off the strict 64 token I have set (since the ground truth summaries were quite short comparatively – more details soon!) So basically, I’ll be doing:
Models + eval artifacts are on HuggingFace. Next: t-tests on combination rewards! Setup: 3x Mac Minis in a cluster running MLX. One node drives training using GRPO, two push rollouts via vLLM. Trained two variants: → length penalty only (baseline) → length penalty + quality reward (BLEU, METEOR and/or ROUGE-L ) Eval: LLM-as-a-Judge (gpt-5) Used DeepEval to build a judge pipeline scoring each summary on 4 axes:
submitted by /u/East-Muffin-6472 |