Practicing science communication on RL-for-reasoning: where does my explanation get the RL wrong?
Some background so you know where I’m coming from: I’m an AI researcher and RL/LLM reasoning was my PhD area. A while back I was asked to give a talk on how RL is used to induce reasoning in LLMs, and afterwards I tried to turn the dense version into a written explainer for a general but technical audience.
I’m trying to get better at science communication, so I’m posting here for the thing this sub is good at, which is telling me where I got the RL wrong or where an analogy smooths over something it shouldn’t.
Link: https://nicolobrandizzi.com/blog/rl-reasoning-llm/
What the post covers:
- RL 101 (state, action, reward) and how it differs from supervised learning
- GES (generate, evaluate, select) as a frame for reasoning
- process vs outcome supervision
- PPO and GRPO, with the advantage / baseline / value function / GAE progression
- the spurious-rewards result (random rewards still improving Qwen but hurting LLaMA, and what that implies about GRPO surfacing existing ability rather than teaching new reasoning)
-
a more speculative closing section where I argue reasoning might be framed as recurrence, and that spatial recurrence is close to (reasoning as iterative denoising)
Two things I’d most like feedback on:
- Do the analogies (lasagna for the supervision spectrum, grocery shopping GES) carry their weight, or do any of them mislead?
- The diffusion-as-reasoning framing in the last section is my own and the most speculative part. If it’s naive or wrong, I’d rather hear it than keep repeating
Fair warning: the post is from October 2025 and I stopped my literature around late August 2025, so it predates newer work.
submitted by /u/nicofirst1
[link] [comments]