Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis – using combination of quality rewards
Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — trying combination of quality rewards with length penalty! So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO! Why combination of quality rewards? ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely. METEOR handles both: it aligns tokens with synonym matching via […]