[P] my shot at a DeepSeek style moe on a single rtx 5090

I know most will wonder why I’m wasting my time training at only 19k tok a sec. It’s because I can. I’m doing this in my living room in my spare time. 0 formal ML experience. The absurd amount I’ve learned in the last few months made me realize I really picked the wrong career.

My Mixture of Experts is 2.36B parameter with 8 routed experts plus a shared expert using top-2 routing. Attention is Grouped Query Attention with QK-normalization and RoPE positional embeddings. All feed-forward layers use SwiGLU activation with RMSNorm throughout. Load balancing follows DeepSeek V3’s auxiliary-loss-free approach using bias-based routing. I monitor coefficient of variation and maximum violation per step.

Training runs on TorchAO FP8 quantization with the Muon optimizer and a multi-stage learning rate schedule (warmup, constant, cosine decay). The backend is optimized for Blackwell architecture with cuBLASLt.

The data pipeline implements MeCo (Metadata Conditioning then Cooldown) with ledger-based deterministic sampling. I have document-aware attention masking and cross-document loss masking but was disabled for the initial MeCo run. I have since disabled MeCo and curated a clean corpus with no tagging of any kind. MeCo worked but it worked too well and with only 8 experts, it became very problematic.

My two biggest early mistakes were not using symmetric router initialization (std=0.006) and not having a dense first layer. Cost me a lot of time and sleep. So what did I do? I cheated. I used aux loss of .003 snd ema smoothing at the beginning. I just didn’t know better. I paid a price later on for that.

DO NOT use router scaling on a small MoE. DeepSeek used 2.5. Kimi K2 used 2.446. I tried 1.2 and it was horribly unstable and violation blew up to over .500.

24 batch 6 Grad LR 3e-4 AdamW+Muon Scaled. Bias .001 Aux .0001. I update every step.

As of yesterday: 2026-01-13 20:53:06 step 41915 | lr 3.00e-04 | loss 1.8867 | gnorm 0.13 | 19,415 tok/s (ema 19,553) | 75.9s/5 steps | cv 0.022 | bias -0.001708±0.179996 | rel_max=0.036 maxvio=0.027 ent=1.203 applied=True | seq_aux 2.444 2026-01-13 20:54:20 [moe] token counts: [150018, 148422, 155402, 147966, 145236, 146724, 144358, 141522] 2026-01-13 20:54:20 step 41920 | lr 3.00e-04 | loss 1.9263 | gnorm 0.13 | 20,102 tok/s (ema 19,828) | 73.4s/5 steps | cv 0.026 | bias -0.001708±0.179920 | rel_max=0.054 maxvio=0.054 ent=1.211 applied=True | seq_aux 2.515

I got a long ways to go 🙂

I’ll gladly answer any question. No gate keeping here.

submitted by /u/exhorder72
[link] [comments]

Liked Liked