Observation Space Design For Long Horizon Task

digitado ⋅ 4 de June de 2026

I’ve been working on a web-based RL Playground using Three.js on the frontend and Gymnasium + PyBullet + PPO (Stable-Baselines3) on the backend.

So far I have successfully trained:

• Navigation to a target • Coin finding • Coin collection

The latest model can navigate toward a coin and perform the collect action when within range.

For my FYP, the expectation is not necessarily many separate agents, but rather an agent capable of executing a longer sequence of interactions (5+). Demo date is 17th June.

Proposed Long-Horizon Task

I’m considering a task chain like:

Find Coin

↓

Collect Coin

↓

Find Deposit

↓

Deposit Coin

↓

Open Gate

↓

Destroy Obstacle

↓

Find Target

↓

Interact With Target

The idea is to train individual abilities through curriculum learning and then combine them into a single policy.

Observation Space Design

Initially I was giving each capability its own Finder observations:

Coin:

[dist, side, depth, in_radius]

Deposit:

[dist, side, depth, in_radius]

Target:

[dist, side, depth, in_radius]

Destroyable:

[dist, side, depth, in_radius]

This started becoming repetitive.

Instead I’m considering introducing a behavior state machine that determines the current objective.

For example:

if holding == 0:

current_goal = COIN

elif deposited == 0:

current_goal = DEPOSIT

elif gate_open == 0:

current_goal = GATE

elif destroyable_destroyed == 0:

current_goal = DESTROYABLE

else:

current_goal = TARGET

The policy would then only receive observations for the active goal.

Proposed Observation Space

# Active Goal Finder

goal_distance

goal_side_signal

goal_depth_signal

goal_in_radius

# Progress State

holding

items_collected

item_deposited

gate_open

destroyable_destroyed

# Goal Indicator

goal_is_coin

goal_is_deposit

goal_is_gate

goal_is_destroyable

goal_is_target

# Navigation

obs_front

obs_left

obs_right

is_blocked

Total is roughly 18-20 dimensions.

The idea is that the policy always sees:

Where is my current objective?

Am I close enough to interact?

What phase of the task am I currently in?

instead of receiving separate direction vectors for every object in the world.

Curriculum Plan

Current thought process:

Stage 1

Find Coin

Stage 2

Collect Coin

Stage 3

Find Deposit

Stage 4

Deposit Coin

Stage 5

Open Gate

Stage 6

Destroy Obstacle

Stage 7

Find Target

Stage 8

Combine everything into a single policy

Each stage would start with fixed spawns and gradually move toward randomized spawns.

Main Question

For those who have trained PPO agents on long-horizon tasks:

1. Does the active-goal observation design seem reasonable? 2. Would you expose only the current objective or all object directions simultaneously? 3. Any obvious pitfalls before I commit to this curriculum approach?

submitted by /u/Public-Journalist820
[link] [comments]

Like 0

Liked Liked