[P] On-device speech toolkit for Apple Silicon — ASR, TTS, diarization, speech-to-speech, all in native Swift

Open-source Swift package running 11 speech models on Apple Silicon via MLX (GPU) and CoreML (Neural Engine). Fully local inference, no cloud dependency.

Models implemented:

ASR – Qwen3-ASR 0.6B/1.7B (4-bit), Parakeet TDT (CoreML INT4) – RTF ~0.06 on M2 Max

TTS – Qwen3-TTS 0.6B (4-bit), CosyVoice3 0.5B (4-bit) – Streaming, ~120ms first chunk

Speech-to-speech – PersonaPlex 7B (4-bit) – Full-duplex, RTF ~0.87

VAD – Silero v5, Pyannote segmentation-3.0 – Streaming + overlap detection

Diarization – Pyannote + WeSpeaker + spectral clustering – Auto speaker count via GMM-BIC

Enhancement – DeepFilterNet3 (CoreML) – Real-time 48kHz noise suppression

Alignment – Qwen3-ForcedAligner – Non-autoregressive, RTF ~0.018

Key design choice: MLX for large models on GPU, CoreML for small models on Neural Engine. This lets you run VAD on ANE while ASR runs on GPU without contention — something WhisperKit struggles with (their Core ML audio encoder blocks the ANE for 300-600ms per call).

All models conform to shared protocols, so you can swap implementations or compose pipelines. Currently working on a MeetingTranscriber pipeline (diarize → per-segment ASR) and streaming real-time diarization.

Roadmap: https://github.com/ivan-digital/qwen3-asr-swift/discussions/81

Repo: https://github.com/ivan-digital/qwen3-asr-swift

submitted by /u/ivan_digital
[link] [comments]

Liked Liked