[P] MNIST from scratch in Metal (C++)

digitado ⋅ 26 de February de 2026

I built a simple 2-layer MNIST MLP that trains + runs inference from scratch, only using Apple’s metal-cpp library.

The goal was to learn GPU programming “for real” and see what actually moves the needle on Apple Silicon. Not just a highly optimized matmul kernel, but also understanding Metal’s API for buffer residency, command buffer structure, and CPU/GPU synchronization. It was fun (and humbling) to see how much those API-level choices affect performance.

Surprisingly I was able to beat MLX’s training speed on small batch sizes in the final version!

Versions:
– MLX baseline
– Pure C CPU baseline
– GPU v1: naive Metal kernels (matmul + ReLU)
– GPU v2: forward + backward kernels + better buffer management + less CPU/GPU sync
– GPU v3: single command buffer per batch (sync only once per epoch for loss)

Repo: https://github.com/abeleinin/mnist-metal

submitted by /u/memes_for_developers
[link] [comments]

Like 0

Liked Liked