[D] – 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings

Wrote up the process of pushing Qwen 3.5 27B (dense, FP8) to 1.1M total tok/s on 96 B200 GPUs with vLLM v0.18.0.

  • DP=8 nearly 4x’d throughput over TP=8. Model is too small for tensor parallelism to help on B200s.
  • MTP-1 mattered more than anything else (GPU utilization was 0% without it). MTP-5 crashed with cudaErrorIllegalAddress.
  • 97.1% scaling efficiency at 8 nodes, 96.5% at 12. TPOT flat at ~46ms regardless of node count.
  • Inference Gateway (KV-cache-aware routing) added ~35% overhead vs ClusterIP round-robin. Single EPP pod is the bottleneck.

InferenceMAX methodology, input-len=1024, output-len=512, 0% prefix cache hit. Worst-case numbers.

https://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592

disclosure: I work for Google Cloud.

submitted by /u/m4r1k_
[link] [comments]

Liked Liked