Stabilizing Cloud Elastic Scaling with Risk-Constrained Reinforcement Learning Under Workload Drift
Elastic scaling in cloud native environments is essential for maintaining service quality and resource efficiency. In practice, frequent traffic bursts and shifts in workload distributions make rule-based methods or approaches with a single optimization objective insufficient. They struggle to ensure system stability and decision reliability at the same time. To address this challenge, this study formulates elastic scaling as a risk-constrained reinforcement learning problem from a sequential decision perspective. A unified framework is used to model resource adjustment actions, system state evolution, and potential instability costs. By explicitly incorporating risk constraints into policy optimization, the proposed approach achieves a dynamic balance between performance optimization and safety control. It prevents service level objective violations and system oscillations caused by aggressive decisions. Resource utilization efficiency and service response behavior are jointly considered, which improves consistency and controllability in complex cloud environments. Comparative evaluation based on real cloud cluster traces shows advantages in service reliability, response performance, and resource usage over existing baselines. These results confirm the effectiveness of risk-aware decision-making under non-stationary workloads. This work provides a systematic modeling approach for the safe application of reinforcement learning in cloud resource management and lays a methodological foundation for stable and efficient intelligent elastic scaling.