[P] MNIST from scratch in Metal (C++)
I built a simple 2-layer MNIST MLP that trains + runs inference from scratch, only using Apple’s metal-cpp library.
The goal was to learn GPU programming “for real” and see what actually moves the needle on Apple Silicon. Not just a highly optimized matmul kernel, but also understanding Metal’s API for buffer residency, command buffer structure, and CPU/GPU synchronization. It was fun (and humbling) to see how much those API-level choices affect performance.
Surprisingly I was able to beat MLX’s training speed on small batch sizes in the final version!
Versions:
– MLX baseline
– Pure C CPU baseline
– GPU v1: naive Metal kernels (matmul + ReLU)
– GPU v2: forward + backward kernels + better buffer management + less CPU/GPU sync
– GPU v3: single command buffer per batch (sync only once per epoch for loss)
submitted by /u/memes_for_developers
[link] [comments]