[R] Is using rotatary embeddings for ViT becoming standard practice or does everyone still use sinusoidal/learnable embedding
I’m going through a few MAE papers which I’m trying to copy from about 2+ years ago and it seems that none of them use rotary embedding. They all use sinusoidal or learned. I’m not sure if this is a ViT quirk or if adoption just happened later.
The only paper I see that talks about it is this paper which only has like 100 citations.
[2403.13298] Rotary Position Embedding for Vision Transformer
submitted by /u/Affectionate_Use9936
[link] [comments]
Like
0
Liked
Liked