[P] MNIST from scratch in Metal (C++)
I built a simple 2-layer MNIST MLP that trains + runs inference from scratch, only using Apple’s metal-cpp library. The goal was to learn GPU programming “for real” and see what actually moves the needle on Apple Silicon. Not just a highly optimized matmul kernel, but also understanding Metal’s API for buffer residency, command buffer structure, and CPU/GPU synchronization. It was fun (and humbling) to see how much those API-level choices affect performance. Surprisingly I was able to […]