[D] Interesting Gradient Norm Goes Down-Up-Down

[D] Interesting Gradient Norm Goes Down-Up-Down

When I’m training an MoE model with modelscope-swift (with megatron as the backend), I find the gradient norm goes up and down during the training phase. Although the language modeling loss continually goes down, I want to figure out why the training process would behave like this. Is it a problem, and how to resolve this issue?

Some details:

  • init: norm with std=0.02
  • lr: warmup 2.5k steps and constant to 4e-4, bsz: 4M tokens
  • setting: pre-training from scratch
  • model: a smaller Qwen3-MoE model of 3B-A900M

https://preview.redd.it/hg2fed5u2ejg1.png?width=352&format=png&auto=webp&s=b49e0a9c6bd46e0f1f0d0b49f37773dfc271700d

https://preview.redd.it/zesiw2fu2ejg1.png?width=364&format=png&auto=webp&s=0ab4d5391721d0cd97b24f1450f307db63b58689

submitted by /u/Spico197
[link] [comments]

Liked Liked