Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — add METEOR as quality reward!
|
Setup: 3x Mac Minis in a cluster running MLX. One node drives training, two push rollouts via vLLM. Trained two variants:
Eval: LLM-as-a-Judge Used DeepEval to build a judge pipeline scoring each summary on 4 axes:
(It’s also why there’s a threading lock around METEOR calls in the reward code — NLTK’s WordNet is not thread-safe ) Models + eval artifacts are on HuggingFace. submitted by /u/East-Muffin-6472 |
Like
0
Liked
Liked