[P] On-device Qwen3-TTS (1.7B/0.6B) inference on iOS and macOS via MLX-Swift — voice cloning, voice design, and streaming TTS with no cloud

digitado ⋅ 2 de March de 2026

Hey r/MachineLearning. I’m a solo dev working on on-device TTS using MLX-Swift with Qwen3-TTS. 1.7B model on macOS, 0.6B on iOS, quantized to 5-bit to fit within mobile memory constraints. No cloud, everything runs locally. The app is called Speaklone.

Short demo video: https://www.youtube.com/watch?v=05gne9oPaaY

The most interesting technical challenge has been MLX’s lazy evaluation on memory-constrained devices. Computation graphs silently accumulate memory through strong references between arrays, and on iOS with a ~4GB jetsam ceiling, you hit the wall fast. Peak generation runs 2.7-3.5GB depending on mode, so there’s almost no headroom.

What ended up working: 512MB MLX cache limit, 3.5GB memory ceiling, converting to native types eagerly per chunk to break the computation graph, and clearing the cache aggressively between generations. Chunked decoding also lets audio stream while the model is still generating, which helps hide latency on slower devices.

One choice I’ve become convinced is right for the platform: I keep the embeddings quantized as well as the weights. That’s unusual, but with the right tuning it’s the right tradeoff when you’re fighting for every megabyte.

Voice cloning works from ~5-30s audio samples, and there’s a voice design mode where natural language descriptions (“warm female narrator, mid-30s”) guide generation without reference audio. Both run on the same pipeline.

It’s on the App Store if anyone wants to try it. Happy to go deeper on any of the MLX deployment stuff.

For those of you shipping products on top of open-weight models: how do you handle the expectation that it should all be free? The engineering to make this stable on a phone is months of work, but there’s always a contingent that sees open weights and assumes the product should be free too. Curious how others navigate that.

I’m also looking into contributing back to some relevant OSS projects. It’s not trivial since I made very different choices in my tech stack, but I think there are a few things that could be shared in a helpful way.

submitted by /u/SurvivalTechnothrill
[link] [comments]

Like 0

Liked Liked