[P] Bypassing CoreML to natively train a 110M Transformer on the Apple Neural Engine (Orion)
It is hard to communicate how frustrating the current Apple ML stack is for low-level research. CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. Despite having up to 38 TOPS (INT8) and ~19 TFLOPS of fp16 compute, the ANE remains almost entirely unused for large language model workloads.
Building on the foundational hardware reverse-engineering by maderix (who mapped the private API surface and benchmarked the 32 MB SRAM cliff), I wanted to see if we could bridge the gap from a raw hardware exploit to a mathematically stable runtime.
I recently open-sourced ORION, to my knowledge the first open end-to-end system that combines direct ANE execution, a custom compiler pipeline, and stable multi-step training.
Just to be transparent about the methodology: I approached this entire build as an exercise in what I’ll call architectural delegation. My day job is Enterprise Program Management, not writing low-level C kernels. I used Claude to rapidly generate the Objective-C syntax while I acted as the system state manager—designing the compiler passes and forcing a probabilistic model to map deterministic hardware boundaries across 140 engineering tasks spanning 14 sessions.
When you map it out, the ANE presents a massive wall of undocumented silicon behavior. We cataloged 17 total programming constraints, 11 of which were newly discovered during ORION’s development. A few of the critical ones:
• The concat operation causes an immediate compilation failure.
• There is a minimum IOSurface size of approximately 49 KB for evaluation.
• BLOBFILE weights require an undocumented offset of 64 bytes from the chunk header, which causes silent weight corruption if incorrect.
• The compiler limits each process to ~119 compilations before silently failing.
To handle this, ORION uses a custom compiler that lowers a 27-operation graph IR through five optimization passes (including Dead Code Elimination, Cast Fusion, and SRAM annotation against the 32 MB budget) to emit ANE-native MIL.
The hardest part was what I’ll call the numerical stability ceiling. Previous attempts at ANE training (like ANEgpt) suffered from 100% NaN divergence after the first training step. We solved this by isolating three interacting bugs:
- Stale Programs on Resume: ANE programs were compiling before checkpoint weights loaded. We fixed this via a deferred compilation pipeline.
The leverage here is real. On an M4 Max, the system hits 170+ tokens/s for GPT-2 124M inference in decode mode. For training, we demonstrated stable multi-step training of a 110M-parameter transformer on TinyStories. Over 1,000 steps, the loss dropped from 12.29 to 6.19 with zero NaN occurrences. To bypass the 119-compilation limit, the runtime uses an exec() restart strategy, passing checkpoint state through the filesystem.
There are real caveats here. Because the ANE bakes weights at compile time, every single weight update requires recompilation. In our loop, compilation consumes ~4.2 s per step, while the actual compute takes ~908 ms (achieving 0.612 TFLOPS).
But imo, this is nowhere near “steady state” time for local AI—this is a layer change. Proving that we can execute mathematically stable, multi-step gradient descent directly on Apple’s locked-down NPU opens up a lot of room for future work on weight patching or incremental compilation.
The repo (Objective-C runtime, Python used only for one-time weight conversion) is MIT licensed and available here:
https://github.com/mechramc/Orion
I would love to hear thoughts from the systems ML folks here on the constraint catalog, or ideas on how to tackle the compile-time weight bottleneck.
submitted by /u/No_Gap_4296
[link] [comments]