[D]Unpopular Opinion: With vLLM raising $150M, I think the industry is still optimizing for the wrong metric. “Throughput” is a solved problem; the real bottleneck is Cold Start Latency.

digitado ⋅ 22 de January de 2026

The news today that Inferact (vLLM) raised $150M at an $800M valuation is huge. It validates that “Inference Efficiency” is the most valuable problem in AI right now.

But looking at where that money and engineering effort is going (Continuous Batching, PagedAttention), I think we are hitting diminishing returns. Everyone is obsessing over Throughput (Tokens/Sec), when the real bottleneck has already shifted to Latency (Time-to-Load).

Here is my thesis on why we need to stop optimizing for throughput:

The “Human Speed Limit” (Throughput is Solved)

We are already generating tokens faster than humans can read.

Average human reading speed: ~5–10 tokens/sec.
H100 w/ vLLM: 200+ tokens/sec.

We have effectively overshot the target for Chatbot UX. Making a model generate 500 t/s helps with batching costs, but it does nothing for the actual user experience.

The “Agent Death Spiral” (Why Latency Matters)

The future isn’t “One Model, One Chat.” It’s “Chain of Thought” Agent swarms routing through specialized adapters (Planner $to$ Coder $to$ Reviewer $to$ Summarizer).

The Problem: On standard serving stacks (K8s/vLLM), loading a new model/adapter takes 20s–40s (cold start).
The Result: A 4-step Agent chain takes ~2 minutes just to boot. This makes real-time multi-agent flows impossible in production.

“Servers” vs. “Functions”

To make Agents viable, we need to stop treating LLMs like “Web Servers” (monolithic processes that stay on) and start treating them like “Serverless Functions” (instantly swappable).

We’ve been building a custom engine that ignores throughput optimization and focuses entirely on PCIe Saturation and CUDA Graph Replay. By bypassing the Python/OS overhead, we managed to get H100 cold starts down to ~1.2s.

My Take:

vLLM is building the “Linux” for the Chatbot era (Compatibility & Throughput).

But for the Agent era, Latency is the new Throughput. If you can’t hot-swap a 70B model in <2 seconds, you can’t serve Agents efficiently.

submitted by /u/pmv143
[link] [comments]

Like 0

Liked Liked