Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster
|
Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is!
The baseline zero-shot answer: not really. Composite G-Eval scores of 2.376 (Qwen) and 2.332 (LFM) under zero-shot prompting, with pass rates of just 21% and 13%. That was the starting point. I tested 12 reward configurations across 2 training strategies:
24 checkpoints total. One clear winner between the two strategies. The quality reward signals:
The staged curriculum wins – consistently. Best composite scores:
Practical takeaways:
The infra was the other fun part. Training on MLX (Apple Silicon, unified memory). Rollouts on distributed vLLM workers via smolcluster. Asynchronous – while the trainer computes gradients for step N, vLLM is already generating rollouts for step N+1. Fitting full GRPO (policy + frozen ref model + activations + optimizer state) in 12 GB required chunked gradient accumulation, gradient checkpointing, and remote rollout generation. No LoRA, full bf16 parameters. PS: All of this was done using smolcluster framework I made and it was really fun and tiring to train without OOMing! Let me of any feedback or any further direction I should take with this project! submitted by /u/East-Muffin-6472 |