[D] Rules for High-Perfomamce Embedding model training?

Hi, I’m thinking about renting b200 with spot prices and learning Qwen3-embedding for my native language (Polish). Now I’m in the process of data gathering but also meanwhile I started thinking how to utilize the b200 with such a small model. My idea is that it is cheaper to use b200 than 5090 for ~x5 time + b200 allow to have much higher batch size.

My assumption: 1. Use full-finetuning (maybe later I would check LORA, but this would require even better pipeline) 2. Use Unsloth FastSentenceTransformer (O assume it have sequence packing, but hard to understand if it is implemented for embeddings models) 3. I want ~512 batch size, so gradient checkpointing would be useful. 4. Bfloat16 training

Do you have any suggestions how to prepare the pipeline to reach ~80% of B200 GPU utilization? My ideas are: 1. Pretokenisation (will padding tokens be remove by unsloth to run sequence packing?) 2. To speed up training, maybe FP8?

submitted by /u/melgor89
[link] [comments]

Liked Liked