[P] Wrote a VLM from scratch! (VIT-base + Q-Former + LORA finetuning)

digitado ⋅ 6 de February de 2026

Hey all. Just sharing a project I have been working on for the past two months. This one is about finetuning text-only language models to become vision language models (VLMs).

Code is open source (repo below). Sharing a YouTube tutorial + results too, for those who are interested.

Heres my full roadmap for future ML devs walking this path:

– used 50k images from the conceptual captions dataset

– VIT-base encoder for backbone, this remained frozen

– Trained a BLIP-2 style Q-Former model.
– Q-Former starts with a distillbert model
– Added randomly init query tokens
– Added additional cross-attention layers to attend to VIT tokens
– Trained with unimodal ITC loss (CLIP)
– Experimented with multimodal losses in BLIP-2 as well (ITM and ITG)

– For LM finetuning
– Used the smallest LM I could find: the SmolLM-135M-Instruct
– Augment synthetic dataset from the conceptual captions image/captions
– Introduced MLP layer to adapt from Q-former space to LM space
– LORA weights for parameter efficient finetuning.

Results were pretty cool. Took about 4 hours to train both Q-Former and LM on one V100. Costed me like 50 cents which was amazing given how cool the results were.

Git repo: https://github.com/avbiswas/vlm

Youtube: https://youtu.be/Oj27kALfvr0

submitted by /u/AvvYaa
[link] [comments]

Like 0

Liked Liked