Practicing science communication on RL-for-reasoning: where does my explanation get the RL wrong?

Some background so you know where I’m coming from: I’m an AI researcher and RL/LLM reasoning was my PhD area. A while back I was asked to give a talk on how RL is used to induce reasoning in LLMs, and afterwards I tried to turn the dense version into a written explainer for a general but technical audience.

I’m trying to get better at science communication, so I’m posting here for the thing this sub is good at, which is telling me where I got the RL wrong or where an analogy smooths over something it shouldn’t.

Link: https://nicolobrandizzi.com/blog/rl-reasoning-llm/

What the post covers:

  • RL 101 (state, action, reward) and how it differs from supervised learning
  • GES (generate, evaluate, select) as a frame for reasoning
  • process vs outcome supervision
  • PPO and GRPO, with the advantage / baseline / value function / GAE progression
  • the spurious-rewards result (random rewards still improving Qwen but hurting LLaMA, and what that implies about GRPO surfacing existing ability rather than teaching new reasoning)
  • a more speculative closing section where I argue reasoning might be framed as recurrence, and that spatial recurrence is close to (reasoning as iterative denoising)

    Two things I’d most like feedback on:

  1. Do the analogies (lasagna for the supervision spectrum, grocery shopping GES) carry their weight, or do any of them mislead?
  2. The diffusion-as-reasoning framing in the last section is my own and the most speculative part. If it’s naive or wrong, I’d rather hear it than keep repeating

Fair warning: the post is from October 2025 and I stopped my literature around late August 2025, so it predates newer work.

submitted by /u/nicofirst1
[link] [comments]

Liked Liked