PPO agent to do load balancing + autoscaling for a Docker cluster (honest writeup + code)

digitado ⋅ 2 de July de 2026

I built a system where a single PPO agent simultaneously handles L7 load balancing and horizontal autoscaling for a Docker based microservice cluster, instead of the usual combo of Round Robin routing plus static CPU thresholds.

Setup

The agent observes per container CPU, RAM, latency, error rate and queue depth, plus a global workload signal, and outputs both continuous routing weights and a scale up/down/hold decision every step.

Training happens in two phases. Phase 1 pretrains on a mathematical M/M/1 queueing simulation (fast, no Docker needed). Phase 2 fine tunes on a real cluster with Docker, HAProxy for routing, and Locust generating traffic.

Evaluation

I benchmarked the trained policy against two baselines across five cluster sizes (N = 5, 10, 15, 20, 25), in both the simulated environment and the real Docker cluster:

A static CPU threshold scaler with Round Robin routing (the common production default)
A PID controller regulating CPU to a 60 percent setpoint, also with Round Robin

Results, the short version

The PID and threshold baselines actually beat PPO on cost efficiency (users served per active container) in most cluster sizes, both simulated and real. PPO does generalize across cluster sizes with no retraining, and it keeps latency well under the SLA ceiling everywhere, but it is not consistently better than classical control here, and its routing precision degrades noticeably at N=25 where the action space becomes 26 dimensional.

I also found that the anti chattering term in the reward is not doing its job well in practice, PPO changes fleet size in over 70 percent of steps versus under 10 percent for the threshold baseline, so it ends up more reactive and “twitchy” than intended.

I wrote this up with the full derivations, per agent metric tables, and a section that’s specifically about where the learned policy falls short, rather than only the wins. Repo has the code, the report, and the result plots.

Repo: https://github.com/MartinFarres/LoadBalancerAutoScaler-DRL

Where I’d take this next

A lot of the real cluster numbers should be read with a grain of salt. The real training and eval runs were short (2k steps) mostly because of hardware constraints, I was running everything on a single machine and couldn’t afford longer iteration counts there, so the sim to real comparison is probably hiding real differences between agents rather than showing they’re actually tied.

The change I’m most interested in for a future version is moving from a homogeneous cluster to a heterogeneous one, containers with different CPU/RAM specs instead of identical replicas. Right now the agent implicitly assumes every node is interchangeable, which is a pretty unrealistic assumption for real infra and probably where a learned policy could actually start to beat static rules, since a PID controller or threshold scaler has a much harder time reasoning about per node capacity differences than a policy that observes them directly.

Happy to get pushback on the reward shaping or the evaluation methodology, this was very much a learning project and I’m sure there are things to improve, especially around the sim to real gap given the real cluster eval window was short.

submitted by /u/TheGrilla_04
[link] [comments]

Like 0

Liked Liked