[P] vLLM-MLX: Native Apple Silicon LLM inference – 464 tok/s on M4 Max
Hey everyone!
I built vLLM-MLX – a framework that uses Apple’s MLX for native GPU acceleration.
What it does:
– OpenAI-compatible API (drop-in replacement for your existing code)
– Multimodal support: Text, Images, Video, Audio – all in one server
– Continuous batching for concurrent users (3.4x speedup)
– TTS in 10+ languages (Kokoro, Chatterbox models)
– MCP tool calling support
Performance on M4 Max:
– Llama-3.2-1B-4bit → 464 tok/s
– Qwen3-0.6B → 402 tok/s
– Whisper STT → 197x real-time
Works with standard OpenAI Python SDK – just point it to localhost.
submitted by /u/waybarrios
[link] [comments]
Like
0
Liked
Liked