[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance

digitado ⋅ 3 de March de 2026

Hello everyone. I trained Qwen2.5-1.5B-Instruct with RLVR and SFT on the GSM8K dataset. RLVR boosted math reasoning by +11.9 points. SFT degraded it by -15.2.

SFT (Supervised Fine-tuning): Standard next-token prediction training on labeled data.

RLVR (Reinforcement Learning with Verifiable Rewards): The training approach behind DeepSeek-R1. The model is reinforced to produce responses that earn higher rewards from a verifiable signal (e.g., correct math answers). This is what enabled models to generate their own chain-of-thought reasoning and led to dramatic improvements in reasoning and agentic tasks.

I ran three experiments:

RLVR vs SFT on GSM8K train split: Standard training and comparison.
Cheating analysis: Training directly on the GSM8K test set to measure data contamination effects.
One-example RLVR: RLVR training with only a single example from two different data sources.

Results:

RLVR training significantly improves GSM8K performance while also improving unrelated MATH scores, suggesting general reasoning improvement, even when training with only one example.

SFT degrades performance significantly on both benchmarks regardless of train or test data. SFT appears to override the model’s pretrained knowledge, making it mimic surface patterns without actually improving reasoning ability. Notably, SFT does reduce the no-answer rate, meaning the model learns to produce answers in the expected format, but the answers themselves are less accurate.

See the training progression plots and results table above.

GPU whirring that went into this project:

Experiment	GPUs	Duration	Epochs
GRPO GSM8K Train	6× RTX 4090	32h 12m	13
GRPO GSM8K Test	8× RTX 3090	20h 09m	30
GRPO GSM8K 1-Example	8× RTX 3090	11h 16m	–
GRPO DSR 1-Example	8× RTX 3090	12h 43m	–
SFT GSM8K Train	1× RTX 5090	2h 46m	7
SFT GSM8K Test	1× RTX 5090	1h 06m	15
Benchmarking 388 Checkpoints	1× RTX 5090	17h 41m	–

388 checkpoints were benchmarked for this project. Every prompt, model response, and extracted answer across all benchmarks is logged in a SQLite database, over 2.4 million rows, viewable live on Hugging Face Spaces via Datasette!

https://huggingface.co/spaces/jayminban/RLVR-vs-SFT-Qwen2.5-1.5b

For detailed analysis, all plots, training code, data, checkpoints, and more, check out the full project on GitHub.

https://github.com/jayminban/RLVR-vs-SFT-Qwen2.5-1.5b

Any feedback or ideas for my next project are greatly appreciated!

submitted by /u/jayminban
[link] [comments]

Like 0

Liked Liked