[D]It feels like LLM inference is missing its AWS Lambda moment.

If we actually wanted “model = function” to work, a few things seem fundamentally required:

•. Fast scale from zero without keeping GPUs alive just to hold state • Execution state reuse so models don’t need full re-init and KV rebuild on every scale event • Clear separation between orchestration and runtime, like Lambda vs the underlying compute • Predictable latency even under spiky, bursty traffic • Cost model that doesn’t assume always-on GPUs 

Today, most inference setups still treat models as long-lived services, which makes scale-to-zero and elasticity awkward.

What’s the real hard blocker to a true Lambda-style abstraction for models? Cold starts, KV cache, GPU memory semantics, scheduling, or something else?

submitted by /u/pmv143
[link] [comments]

Liked Liked