I built a custom Gymnasium environment to compare PPO against classical elevator dispatching – looking for feedback on my approach
Hey everyone, I’ve been working on an RL project where I trained a PPO agent to control 4 elevators in a 20-floor building simulation. The goal was to see if RL can beat a classical Destination Dispatching algorithm.
Results after 5M training steps on CPU:
Classic agent: mean reward -0.67, avg wait 601 steps
PPO agent: mean reward +0.14, avg wait 93 steps (~84% reduction)
The hardest part was reward engineering – took several iterations to get dense enough feedback for stable learning. Happy to share details on what failed.
GitHub: https://github.com/jonas-is-coding/elevator-ai
Still working on realistic elevator kinematics (acceleration, door cycles). Would love feedback on whether my environment design and reward structure are sound – especially whether the comparison against the classic baseline is fair.
submitted by /u/Impossible_Case497
[link] [comments]