[R] Is using rotatary embeddings for ViT becoming standard practice or does everyone still use sinusoidal/learnable embedding

I’m going through a few MAE papers which I’m trying to copy from about 2+ years ago and it seems that none of them use rotary embedding. They all use sinusoidal or learned. I’m not sure if this is a ViT quirk or if adoption just happened later.

The only paper I see that talks about it is this paper which only has like 100 citations.

[2403.13298] Rotary Position Embedding for Vision Transformer

submitted by /u/Affectionate_Use9936
[link] [comments]

Liked Liked