Hi everyone,
we’re a group of students working on a reinforcement learning project and we’re honestly pretty stuck. We’ve been trying to solve this for weeks and feel like we’re missing some fundamental understanding. Below are the main problems we’re facing:
Problem Setup
- We train an agent to pick up items and deliver them to a target on a small grid.
- We use PPO as we thought it is most advanced
- We split the reward 50/50 between pickup and delivery
- No matter what we try, we can’t achieve a reasonable final validation score beating a greedy algorithm.
Observation / State Representation Issues
- We came up with several features, including:
- Item Time-to-live (TTL) adjusted by agent distance
- Priority distance (agent → item → target)
- Agent position with distance to target
- TTL minus item-to-target distance
- Agent load status map (he can only pick up one item)
- Despite this, learning does not improve, and the behavior is still not ideal at times.
- We also tried CNN + subsequent NN (hybrid network) → similar bad results.
Training Instability
- PPO appears extremely sensitive to hyperparameters.
- We use Bayesian Optimization for tuning, but:
- Small changes lead to wildly different results
- Hard to identify stable configurations
- Changing:
- Number of episodes
- Episode length (200 steps)
- Leads to huge fluctuations in validation performance.
We don’t really understand whether the issue is PPO itself, reward shaping, or our observation space. We’d really appreciate any pointers, explanations, or solutions. Thanks a lot in advance 🙏