[D]It feels like LLM inference is missing its AWS Lambda moment.

digitado ⋅ 17 de January de 2026

If we actually wanted “model = function” to work, a few things seem fundamentally required:

•. Fast scale from zero without keeping GPUs alive just to hold state • Execution state reuse so models don’t need full re-init and KV rebuild on every scale event • Clear separation between orchestration and runtime, like Lambda vs the underlying compute • Predictable latency even under spiky, bursty traffic • Cost model that doesn’t assume always-on GPUs

Today, most inference setups still treat models as long-lived services, which makes scale-to-zero and elasticity awkward.

What’s the real hard blocker to a true Lambda-style abstraction for models? Cold starts, KV cache, GPU memory semantics, scheduling, or something else?

submitted by /u/pmv143
[link] [comments]

Like 0

Liked Liked