[D] Interesting Gradient Norm Goes Down-Up-Down
|
When I’m training an MoE model with modelscope-swift (with megatron as the backend), I find the gradient norm goes up and down during the training phase. Although the language modeling loss continually goes down, I want to figure out why the training process would behave like this. Is it a problem, and how to resolve this issue? Some details:
submitted by /u/Spico197 |
Like
0
Liked
Liked