Struggling to get PPO to work for pickup & delivery task — stuck, need for guidance

digitado ⋅ 15 de January de 2026

Hi everyone,

we’re a group of students working on a reinforcement learning project and we’re honestly pretty stuck. We’ve been trying to solve this for weeks and feel like we’re missing some fundamental understanding. Below are the main problems we’re facing:

Problem Setup

We train an agent to pick up items and deliver them to a target on a small grid.
We use PPO as we thought it is most advanced
We split the reward 50/50 between pickup and delivery
No matter what we try, we can’t achieve a reasonable final validation score beating a greedy algorithm.

Observation / State Representation Issues

We came up with several features, including:
- Item Time-to-live (TTL) adjusted by agent distance
- Priority distance (agent → item → target)
- Agent position with distance to target
- TTL minus item-to-target distance
- Agent load status map (he can only pick up one item)
Despite this, learning does not improve, and the behavior is still not ideal at times.
We also tried CNN + subsequent NN (hybrid network) → similar bad results.

Training Instability

PPO appears extremely sensitive to hyperparameters.
We use Bayesian Optimization for tuning, but:
- Small changes lead to wildly different results
- Hard to identify stable configurations
Changing:
- Number of episodes
- Episode length (200 steps)
Leads to huge fluctuations in validation performance.

We don’t really understand whether the issue is PPO itself, reward shaping, or our observation space. We’d really appreciate any pointers, explanations, or solutions. Thanks a lot in advance 🙏

submitted by /u/luk_pp
[link] [comments]

Like 0

Liked Liked