Reason Tuning Qwen2.5-0.5B-Instruct on GSM8K dataset using GRPO written from scratch
So, I have been trying to reason tune a qwen2.5 0.5B instruct model on gsm8k math dataset on my Mac mini cluster for some time using GRPO I wrote from scratch It’s just reward hacking. Why? Because I the answer or the correct answer reward signal is too shallow like only reward if the final answer is correct nothing in between So I added a format reward so that the rewards and thus the advantages don’t become near […]