Beginner question about interpreting a step change in training metrics
| |
I am playing around with RL as a learning experience and have a really simple task to sort a sequence of 10 digits using GRPO. I am using a Qwen 3-like Transformer from scratch with 6 layers and embeddings of 256d for a dictionary that only knows those 10 digits. Now looking at charts of the training metrics I am wondering about a step change I see after 4800 steps of training. I see that the reward has been growing relatively flat over multiple thousands of steps and then suddenly it goes up. At the same time the advantages’ std goes up as well (trialing something new?), entropy goes up (zoomed in on the screenshot), and the grad norm afterwards goes down. How would you interpret that? Would you log some other metric for more insights? I create the samples to learn from randomly and do not schedule any changes to that mechanism over time. Also the LR is scheduled to go down smoothly after the initial warmup. At 4800 there was certainly no step change that I scheduled. To me it looks like it found some little break through accidentally, sampling some new path. But given that the model has only 10 actions I wonder why this could be the case. There shouldn’t be any unexplored paths after a few steps, no? I want to add though that the sequences have 30 steps, so maybe the potentially space is bigger, i.e. 10**30, and it took a while to find a local pattern? I wondering if I am stumbling over something mechanically here. Thoughts? submitted by /u/Glittering-Feed855 |