Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?

I imagine not, and I’m trying to think why:

– marginal gains?

– pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)?

– scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this?

or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?

submitted by /u/howtorewriteaname
[link] [comments]

Liked Liked