Observation Space Design For Long Horizon Task

Observation Space Design For Long Horizon Task

I’ve been working on a web-based RL Playground using Three.js on the frontend and Gymnasium + PyBullet + PPO (Stable-Baselines3) on the backend.

So far I have successfully trained:

• Navigation to a target • Coin finding • Coin collection 

The latest model can navigate toward a coin and perform the collect action when within range.

For my FYP, the expectation is not necessarily many separate agents, but rather an agent capable of executing a longer sequence of interactions (5+). Demo date is 17th June.

Proposed Long-Horizon Task

I’m considering a task chain like:

Find Coin

Collect Coin

Find Deposit

Deposit Coin

Open Gate

Destroy Obstacle

Find Target

Interact With Target

The idea is to train individual abilities through curriculum learning and then combine them into a single policy.

Observation Space Design

Initially I was giving each capability its own Finder observations:

Coin:

[dist, side, depth, in_radius]

Deposit:

[dist, side, depth, in_radius]

Target:

[dist, side, depth, in_radius]

Destroyable:

[dist, side, depth, in_radius]

This started becoming repetitive.

Instead I’m considering introducing a behavior state machine that determines the current objective.

For example:

if holding == 0:

current_goal = COIN

elif deposited == 0:

current_goal = DEPOSIT

elif gate_open == 0:

current_goal = GATE

elif destroyable_destroyed == 0:

current_goal = DESTROYABLE

else:

current_goal = TARGET

The policy would then only receive observations for the active goal.

Proposed Observation Space

# Active Goal Finder

goal_distance

goal_side_signal

goal_depth_signal

goal_in_radius

# Progress State

holding

items_collected

item_deposited

gate_open

destroyable_destroyed

# Goal Indicator

goal_is_coin

goal_is_deposit

goal_is_gate

goal_is_destroyable

goal_is_target

# Navigation

obs_front

obs_left

obs_right

is_blocked

Total is roughly 18-20 dimensions.

The idea is that the policy always sees:

Where is my current objective?

Am I close enough to interact?

What phase of the task am I currently in?

instead of receiving separate direction vectors for every object in the world.

Curriculum Plan

Current thought process:

Stage 1

Find Coin

Stage 2

Collect Coin

Stage 3

Find Deposit

Stage 4

Deposit Coin

Stage 5

Open Gate

Stage 6

Destroy Obstacle

Stage 7

Find Target

Stage 8

Combine everything into a single policy

Each stage would start with fixed spawns and gradually move toward randomized spawns.

Main Question

For those who have trained PPO agents on long-horizon tasks:

1. Does the active-goal observation design seem reasonable? 2. Would you expose only the current objective or all object directions simultaneously? 3. Any obvious pitfalls before I commit to this curriculum approach? 

submitted by /u/Public-Journalist820
[link] [comments]

Liked Liked