[D] Is Grokking unique to transformers/attention?

Is Grokking unique to attention mechanism, every time I’ve read up on it seems to suggest that’s it a product of attention and models that utilise it. Is this the case or can standard MLP also start grokking?

submitted by /u/Dependent-Shake3906
[link] [comments]

Liked Liked