[P] Wrote a VLM from scratch! (VIT-base + Q-Former + LORA finetuning)
Hey all. Just sharing a project I have been working on for the past two months. This one is about finetuning text-only language models to become vision language models (VLMs).
Code is open source (repo below). Sharing a YouTube tutorial + results too, for those who are interested.
Heres my full roadmap for future ML devs walking this path:
– used 50k images from the conceptual captions dataset
– VIT-base encoder for backbone, this remained frozen
– Trained a BLIP-2 style Q-Former model.
– Q-Former starts with a distillbert model
– Added randomly init query tokens
– Added additional cross-attention layers to attend to VIT tokens
– Trained with unimodal ITC loss (CLIP)
– Experimented with multimodal losses in BLIP-2 as well (ITM and ITG)
– For LM finetuning
– Used the smallest LM I could find: the SmolLM-135M-Instruct
– Augment synthetic dataset from the conceptual captions image/captions
– Introduced MLP layer to adapt from Q-former space to LM space
– LORA weights for parameter efficient finetuning.
Results were pretty cool. Took about 4 hours to train both Q-Former and LM on one V100. Costed me like 50 cents which was amazing given how cool the results were.
Git repo: https://github.com/avbiswas/vlm
Youtube: https://youtu.be/Oj27kALfvr0
submitted by /u/AvvYaa
[link] [comments]