[D] Is Grokking unique to transformers/attention?
Is Grokking unique to attention mechanism, every time I’ve read up on it seems to suggest that’s it a product of attention and models that utilise it. Is this the case or can standard MLP also start grokking?
submitted by /u/Dependent-Shake3906
[link] [comments]
Like
0
Liked
Liked