[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance
|
Hello everyone. I trained Qwen2.5-1.5B-Instruct with RLVR and SFT on the GSM8K dataset. RLVR boosted math reasoning by +11.9 points. SFT degraded it by -15.2. SFT (Supervised Fine-tuning): Standard next-token prediction training on labeled data. RLVR (Reinforcement Learning with Verifiable Rewards): The training approach behind DeepSeek-R1. The model is reinforced to produce responses that earn higher rewards from a verifiable signal (e.g., correct math answers). This is what enabled models to generate their own chain-of-thought reasoning and led to dramatic improvements in reasoning and agentic tasks. I ran three experiments:
Results: RLVR training significantly improves GSM8K performance while also improving unrelated MATH scores, suggesting general reasoning improvement, even when training with only one example. SFT degrades performance significantly on both benchmarks regardless of train or test data. SFT appears to override the model’s pretrained knowledge, making it mimic surface patterns without actually improving reasoning ability. Notably, SFT does reduce the no-answer rate, meaning the model learns to produce answers in the expected format, but the answers themselves are less accurate. See the training progression plots and results table above. GPU whirring that went into this project:
388 checkpoints were benchmarked for this project. Every prompt, model response, and extracted answer across all benchmarks is logged in a SQLite database, over 2.4 million rows, viewable live on Hugging Face Spaces via Datasette! https://huggingface.co/spaces/jayminban/RLVR-vs-SFT-Qwen2.5-1.5b For detailed analysis, all plots, training code, data, checkpoints, and more, check out the full project on GitHub. https://github.com/jayminban/RLVR-vs-SFT-Qwen2.5-1.5b Any feedback or ideas for my next project are greatly appreciated! submitted by /u/jayminban |