[Project Review] Attempting Multi-Warehouse VRP with Heterogeneous Fleet (REINFORCE). Stuck on the “Efficiency vs. Effectiveness” trade-off

Hi everyone,

I am an RL novice working on my first “real” project: a solver for the Multi-Warehouse Vehicle Routing Problem (MWVRP). My background is limited (I’ve essentially only read the DeepMDV paper and some standard VRP literature), so I am looking for a sanity check on my approach, as well as recommendations for papers or codebases that tackle similar constraints.

The Problem Setting:

I am modeling a supply chain with:

  • Multiple Depots & Heterogeneous Fleet (Vans, Medium Trucks, Heavy Trucks with different costs/capacities).
  • Multi-SKU Orders: Customers require specific items (weights/volumes), and vehicles must carry the correct inventory.
  • Graph: Real-world city topology (approx. 50-100 active nodes per episode).

My Current Approach:

  • Architecture: Attention-based Encoder-Decoder (similar to Kool et al. / DeepMDV).
    • Graph Encoder: Encodes customer/depot nodes.
    • Tour Decoder: Selects which vehicle acts next.
    • Node Decoder: Selects the next node for the selected vehicle.
  • Algorithm: REINFORCE with a Greedy Rollout Baseline (Student-Teacher).
  • Action Space: Discrete selection of (Vehicle, Node).

The Challenge: “Drunk but Productive” Agents

Initially, I used a sparse reward (pure negative distance cost + big bonus for clearing all orders). The agent failed to learn anything and just stayed at the depot to minimize cost.

I switched to Dense Rewards:

  • +1.0 per unit of weight delivered.
  • +10.0 bonus for fully completing an order.
  • -0.1 * distance penalty (scaled down so it doesn’t overpower the delivery reward).

The Result: The agent is now learning! It successfully clears ~90% of orders in validation. However, it is wildly inefficient. It behaves like it’s “driving drunk”, zigzagging across the map to grab rewards because the delivery reward outweighs the fuel cost. It has learned Effectiveness (deliver the goods) but not Efficiency (shortest path).

My Questions for the Community:

  1. Transitioning from Dense to Sparse: How do I wean the agent off these “training wheels” (dense rewards)? If I remove them now, will the policy collapse? Should I anneal the delivery reward to zero over time?
  2. Handling SKU Matching: My model is somewhat “blind” to specific inventory. I handle constraints via masking (masking out customers if the truck doesn’t have the right SKU). Is there a better way to embed “Inventory State” into the transformer without exploding the feature space?
  3. Architecture: Is REINFORCE stable enough for this complexity, or is moving to PPO/A2C practically mandatory for Heterogeneous VRPs?
  4. Resources: Are there specific papers or repos that handle Multi-Depot + Inventory Constraints well? Most VRP papers seem to assume a single depot or infinite capacity.

Any advice, papers, or “you’re doing it wrong” feedback is welcome. Thanks!

submitted by /u/PolarIceBear_
[link] [comments]

Liked Liked