Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch – updates! [P]
|
So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image! This was with quality_reward + length_penalty (more info below!) Next, I’ll be going with length penalty as the reward and with the mistake of counting characters as tokens fixed and see if there is any gaming the system stuff or degraded outputs! The rewards I used were 2:
One node drives training using GRPO, two push rollouts via vLLM. Trained two variants:
Eval: LLM-as-a-Judge (gpt-5)
submitted by /u/East-Muffin-6472 |
Like
0
Liked
Liked