I Watched Our AI Pipeline Silently Fail While Kubernetes Said Everything Was Fine

The dashboards were green. The pods were healthy. Our users were timing out. Here’s what I learned.


Black Friday simulation. My team is running load tests on our retail AI pipeline — the one that enriches every product page with real-time personalization and LLM-generated descriptions. Traffic ramps up. We’re watching the dashboards.

CPU: 55%. Comfortable.

Pods: healthy. All green checkmarks.

Request latency: climbing. 2x. 4x. Users hitting timeouts.

And Kubernetes? Kubernetes thought everything was fine.

It took us a while to understand what was actually happening — and once we did, it changed how I think about autoscaling AI systems entirely.


The Architecture That Fooled Us

Our setup was straightforward. Every incoming API request triggered three things:

  1. An LLM call to enrich product data

  2. A real-time personalization scoring pass

  3. Response aggregation and delivery

   API Gateway → Service Layer → AI Enrichment Service → Response


We deployed it with a standard Horizontal Pod Autoscaler targeting 70% CPU utilization. This is the default playbook. It works fine for conventional microservices.

Here’s what happened under load:

  • Incoming request rate climbed steadily (simulating a flash sale event)
  • AI inference requests started queuing behind GPU workers
  • Response latency spiked from ~200ms to over 4 seconds
  • Queue depth grew nearly 10x from baseline

CPU? Held steady at 55%. Never breached our threshold.

Kubernetes didn’t scale a single pod.


Why CPU Is the Wrong Signal for AI Systems

Traditional autoscaling is built on one core assumption: resource utilization reflects system load.

For a typical stateless API — request comes in, gets processed synchronously, response goes out — this assumption mostly holds. When things get busy, CPU goes up, the autoscaler fires, new pods appear.

AI workloads break this assumption at every layer.

GPU-bound inference: The actual work happens on the GPU. CPU is largely an orchestration layer — it manages queues, shuffles data, handles I/O. You can be completely saturated at the inference level while CPU lounges around at 40%.

Queue-driven execution: LLM requests don’t get processed instantly. They wait for available workers. That wait is invisible to standard metrics unless you’re explicitly watching queue depth.

Token-based latency variability: Response time for a language model depends on how many tokens it generates. A product description might take 200ms. A detailed review summary might take 3 seconds. CPU sees the same load for both; your user experience does not.

Long-lived requests: A single LLM request can hold a connection for multiple seconds. Thread counts, connection pools, and GPU memory all become bottlenecks well before CPU does.

The result: your system can be failing in ways CPU will never show you.


The Signal That Actually Told the Truth

Once we instrumented deeper — specifically, adding Prometheus metrics around queue depth and per-request latency — the story became obvious.

At peak load in our retail simulation:

| Metric | Baseline | Peak Load |
|—-|—-|—-|
| CPU utilization | 38% | 55% |
| GPU utilization | 62% | 98% |
| Queue depth | ~5 | ~48 |
| P95 latency | 210ms | 4,100ms |

CPU told us everything was manageable. Queue depth told us we were collapsing.

The queue is where demand becomes visible before it becomes a user-facing problem. When queue depth climbs, it means your workers can’t keep up with incoming requests. That’s the moment to scale — not after latency has already spiked and users are seeing timeouts.


Implementing Queue-Driven Autoscaling With KEDA

KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that extends Kubernetes autoscaling to support external event sources. Instead of watching CPU, it can scale based on queue length, Prometheus metrics, message broker depth, database query results — essentially any signal you can expose.

Here’s the ScaledObject configuration example we deployed for our AI inference service:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-inference-scaler
  namespace: ai-services
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference-deployment
  pollingInterval: 30
  cooldownPeriod: 120
  minReplicaCount: 2
  maxReplicaCount: 20

  triggers:
    - type: redis
      metadata:
        host: redis
        port: "6379"
        listName: ai_request_queue
        listLength: "50"

    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_latency_p95
        threshold: "2000"
        query: |
          histogram_quantile(0.95,
            sum(rate(inference_latency_bucket[2m])) by (le))

Two triggers, working together:

Queue depth (Redis): This is your early warning system. When requests start piling up faster than workers can process them, the queue grows. KEDA catches this before users feel it. In our retail scenario, this was the primary trigger during ramp-up — requests were queuing before latency even started climbing.

P95 latency (Prometheus): This is your SLO guardrail. Even if the queue hasn’t grown dramatically, if individual requests are taking too long, something is wrong. A single slow model, a cold-start penalty, a downstream bottleneck — latency captures all of these. This trigger saved us during a model warmup event that the queue signal missed entirely.

Neither trigger alone would have been sufficient. Together, they capture what CPU never could.


What Changed in Production (After Tuning)

After deploying KEDA with these dual triggers, we re-ran the same Black Friday simulation:

Before KEDA:

  • Kubernetes observed no scaling need
  • Queue grew to 48+ items
  • P95 latency: 4,100ms
  • User-facing timeout rate: ~12%

After KEDA:

  • Pods began scaling at queue depth ~50 (before user impact)
  • Peak pod count reached 11 (from 3)
  • P95 latency held under 350ms throughout
  • Timeout rate: effectively 0%

The system didn’t just perform better. It became predictable — which, for retail workloads that have to survive flash sales, is arguably more valuable than peak performance.


Three Patterns That Actually Work

If you’re running AI workloads on Kubernetes, here are the scaling patterns worth your time:

1. Queue depth as the primary trigger

Pick your queue (Redis, SQS, RabbitMQ, Kafka — KEDA supports all of them). Set a threshold that represents “workers are falling behind.” Scale proactively, before users notice.

2. Latency as the secondary guardrail

Expose P95 or P99 latency through Prometheus. Set thresholds based on your SLOs. This catches degradation patterns the queue won’t surface — cold starts, model loading delays, downstream slowdowns.

3. GPU utilization for inference-heavy deployments

If you’re running dedicated GPU nodes, expose GPU utilization via DCGM Exporter and use it as a KEDA Prometheus trigger. This is especially valuable for batch inference workloads where individual request latency is less meaningful than worker saturation.


The Anti-Patterns to Stop Doing

Cranking up CPU limits doesn’t help when CPU isn’t the bottleneck. I’ve seen teams double their CPU allocations and watch performance stay flat because the GPU was the constraint the entire time.

Tuning HPA thresholds lower (e.g., scaling at 40% CPU instead of 70%) causes over-provisioning and higher costs without actually solving the root signal problem.

Adding more nodes addresses capacity but not signal. If you’re scaling based on the wrong metric, more nodes just means you’re wrong at larger scale.

The fix isn’t more resources. It’s watching the right signals.


Where This Matters Most Right Now

If you’re building any of the following, you need to think about this:

  • LLM inference services — token generation is time and resource-intensive in ways CPU won’t capture
  • RAG pipelines — retrieval + generation chains have multi-stage queuing that compounds
  • Real-time personalization — burst traffic patterns (think retail events, news spikes) create sudden queue depth growth
  • AI-powered APIs at scale — any latency SLO under 1 second is extremely vulnerable to queue buildup

The common thread: demand arrives faster than workers can process it, and CPU doesn’t know.


Final Thought

Kubernetes is doing exactly what it was designed to do. It watches the metrics it knows about and scales accordingly. The problem is that we’ve connected AI workloads — which behave fundamentally differently from stateless APIs — to a system built around a signal (CPU) that those workloads don’t express well.

The fix isn’t complex. It’s instrumenting the right things, exposing them through Prometheus, and letting KEDA act on them. Once you make that shift — from scaling based on resource usage to scaling based on system intent — you stop watching dashboards that lie to you.

Start with the queue. That’s where the truth lives.


Liked Liked