[P] vLLM-MLX: Native Apple Silicon LLM inference – 464 tok/s on M4 Max

digitado ⋅ 16 de January de 2026

Hey everyone!

I built vLLM-MLX – a framework that uses Apple’s MLX for native GPU acceleration.

What it does:

– OpenAI-compatible API (drop-in replacement for your existing code)

– Multimodal support: Text, Images, Video, Audio – all in one server

– Continuous batching for concurrent users (3.4x speedup)

– TTS in 10+ languages (Kokoro, Chatterbox models)

– MCP tool calling support

Performance on M4 Max:

– Llama-3.2-1B-4bit → 464 tok/s

– Qwen3-0.6B → 402 tok/s

– Whisper STT → 197x real-time

Works with standard OpenAI Python SDK – just point it to localhost.

GitHub: https://github.com/waybarrios/vllm-mlx

Like 0

Liked Liked