[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance

[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance

Hello everyone. I trained Qwen2.5-1.5B-Instruct with RLVR and SFT on the GSM8K dataset. RLVR boosted math reasoning by +11.9 points. SFT degraded it by -15.2.

SFT (Supervised Fine-tuning): Standard next-token prediction training on labeled data.

RLVR (Reinforcement Learning with Verifiable Rewards): The training approach behind DeepSeek-R1. The model is reinforced to produce responses that earn higher rewards from a verifiable signal (e.g., correct math answers). This is what enabled models to generate their own chain-of-thought reasoning and led to dramatic improvements in reasoning and agentic tasks.

I ran three experiments:

  1. RLVR vs SFT on GSM8K train split: Standard training and comparison.
  2. Cheating analysis: Training directly on the GSM8K test set to measure data contamination effects.
  3. One-example RLVR: RLVR training with only a single example from two different data sources.

Results:

RLVR training significantly improves GSM8K performance while also improving unrelated MATH scores, suggesting general reasoning improvement, even when training with only one example.

SFT degrades performance significantly on both benchmarks regardless of train or test data. SFT appears to override the model’s pretrained knowledge, making it mimic surface patterns without actually improving reasoning ability. Notably, SFT does reduce the no-answer rate, meaning the model learns to produce answers in the expected format, but the answers themselves are less accurate.

See the training progression plots and results table above.

GPU whirring that went into this project:

Experiment GPUs Duration Epochs
GRPO GSM8K Train 6× RTX 4090 32h 12m 13
GRPO GSM8K Test 8× RTX 3090 20h 09m 30
GRPO GSM8K 1-Example 8× RTX 3090 11h 16m
GRPO DSR 1-Example 8× RTX 3090 12h 43m
SFT GSM8K Train 1× RTX 5090 2h 46m 7
SFT GSM8K Test 1× RTX 5090 1h 06m 15
Benchmarking 388 Checkpoints 1× RTX 5090 17h 41m

388 checkpoints were benchmarked for this project. Every prompt, model response, and extracted answer across all benchmarks is logged in a SQLite database, over 2.4 million rows, viewable live on Hugging Face Spaces via Datasette!

https://huggingface.co/spaces/jayminban/RLVR-vs-SFT-Qwen2.5-1.5b

For detailed analysis, all plots, training code, data, checkpoints, and more, check out the full project on GitHub.

https://github.com/jayminban/RLVR-vs-SFT-Qwen2.5-1.5b

Any feedback or ideas for my next project are greatly appreciated!

submitted by /u/jayminban
[link] [comments]

Liked Liked