Observation Space Design For Long Horizon Task
|
I’ve been working on a web-based RL Playground using Three.js on the frontend and Gymnasium + PyBullet + PPO (Stable-Baselines3) on the backend. So far I have successfully trained:
The latest model can navigate toward a coin and perform the collect action when within range. For my FYP, the expectation is not necessarily many separate agents, but rather an agent capable of executing a longer sequence of interactions (5+). Demo date is 17th June. Proposed Long-Horizon Task I’m considering a task chain like: Find Coin ↓ Collect Coin ↓ Find Deposit ↓ Deposit Coin ↓ Open Gate ↓ Destroy Obstacle ↓ Find Target ↓ Interact With Target The idea is to train individual abilities through curriculum learning and then combine them into a single policy. Observation Space Design Initially I was giving each capability its own Finder observations: Coin: [dist, side, depth, in_radius] Deposit: [dist, side, depth, in_radius] Target: [dist, side, depth, in_radius] Destroyable: [dist, side, depth, in_radius] This started becoming repetitive. Instead I’m considering introducing a behavior state machine that determines the current objective. For example: if holding == 0: current_goal = COIN elif deposited == 0: current_goal = DEPOSIT elif gate_open == 0: current_goal = GATE elif destroyable_destroyed == 0: current_goal = DESTROYABLE else: current_goal = TARGET The policy would then only receive observations for the active goal. Proposed Observation Space # Active Goal Finder goal_distance goal_side_signal goal_depth_signal goal_in_radius # Progress State holding items_collected item_deposited gate_open destroyable_destroyed # Goal Indicator goal_is_coin goal_is_deposit goal_is_gate goal_is_destroyable goal_is_target # Navigation obs_front obs_left obs_right is_blocked Total is roughly 18-20 dimensions. The idea is that the policy always sees: Where is my current objective? Am I close enough to interact? What phase of the task am I currently in? instead of receiving separate direction vectors for every object in the world. Curriculum Plan Current thought process: Stage 1 Find Coin Stage 2 Collect Coin Stage 3 Find Deposit Stage 4 Deposit Coin Stage 5 Open Gate Stage 6 Destroy Obstacle Stage 7 Find Target Stage 8 Combine everything into a single policy Each stage would start with fixed spawns and gradually move toward randomized spawns. Main Question For those who have trained PPO agents on long-horizon tasks:
submitted by /u/Public-Journalist820 |