Multi-Objective Scheduling for Large Language Model Inference with Prompt-Level Cost Prediction and SLO Awareness

digitado ⋅ 20 de April de 2026

Large language model (LLM) inference in multi-tenant clouds is becoming an increasingly important contributor to data-center carbon emissions, yet existing carbon-aware scheduling techniques target long-running training jobs and are ill-suited for the short, bursty, SLO-sensitive nature of online serving. We propose CAPS (Carbon–Aware Prompt Scheduling), an online bi-objective scheduler that jointly optimizes goodput and per-request carbon cost for multi-tenant LLM inference. CAPS first employs a lightweight prompt complexity predictor to estimate token generation cost and latency risk for each incoming request. It then combines real-time grid carbon intensity, GPU energy profiles, and per-tenant SLO tiers to route each request to one of three execution pools: a low-latency pool, a low-carbon pool, or a delay-tolerant batch pool. A composite reward function balances goodput, carbon emissions, and SLO violation rate. In trace-driven simulations using public conversation traces and regional carbon intensity data, CAPS reduces average carbon emissions per 1K generated tokens by 26.8% compared to round-robin scheduling while achieving an SLO attainment rate that matches or exceeds a dedicated SLO-aware baseline.

Like 0

Liked Liked