I built a custom Gymnasium environment to compare PPO against classical elevator dispatching – looking for feedback on my approach

Hey everyone, I’ve been working on an RL project where I trained a PPO agent to control 4 elevators in a 20-floor building simulation. The goal was to see if RL can beat a classical Destination Dispatching algorithm.

Results after 5M training steps on CPU:

Classic agent: mean reward -0.67, avg wait 601 steps

PPO agent: mean reward +0.14, avg wait 93 steps (~84% reduction)

The hardest part was reward engineering – took several iterations to get dense enough feedback for stable learning. Happy to share details on what failed.

GitHub: https://github.com/jonas-is-coding/elevator-ai

Still working on realistic elevator kinematics (acceleration, door cycles). Would love feedback on whether my environment design and reward structure are sound – especially whether the comparison against the classic baseline is fair.

submitted by /u/Impossible_Case497
[link] [comments]

Liked Liked