[R] When Does Policy Conditioning Actually Help? A Controlled Study on Adaptation vs. Robustness
TL;DR: We ran a factorial study on policy conditioning (appending a “goal” signal to observations). We found that while it barely improves “tracking precision,” it leads to a 23x improvement in tail-risk (CVaR). Crucially, we prove that temporal correlation—not just having the extra data—is the causal driver.
The Problem: The “Black Box” of Conditioning
In RL, we often append a task descriptor (goal, context vector, or latent) to the agent’s observation. We assume it helps the agent adapt. But why? Is it just the extra input dimension? The marginal statistics? Or the temporal alignment with the reward?
We disentangled this using a modified LunarLanderContinuous-v3 where the lander must track non-stationary target velocities while landing safely.
The Experimental Design
We trained PPO agents under four strictly controlled conditions to isolate the causal mechanism:
| Condition | Observation | What it controls for |
|---|---|---|
| Baseline | Standard Obs | The lower bound (reward-only learning). |
| Noise | Obs + i.i.d. Noise | Effect of increased input dimensionality. |
| Shuffled | Obs + Permuted Signal | Effect of the signal’s marginal distribution. |
| Conditioned | Obs + True Signal | The full information condition. |
Key Findings
1. Robustness > Precision (The Headline Result)
Surprisingly, all agents showed similar mean tracking errors. They all prioritized “don’t crash” over “hit the target velocity.” However, the Conditioned agent was massively more robust:
- CVaR(10%) Improvement: The Conditioned agent achieved a 23x better tail-risk score than the Baseline (-1.7 vs -39.4).
- The Causal Driver: The Conditioned agent significantly outperformed the Shuffled agent. This proves that temporal correlation—the alignment of the signal with the current reward—is the operative factor, not just the presence of the data values.
2. The Linear Probe (The “Lie Detector”)
We ran a linear probe (Ridge regression) on the hidden layers to see if the agents “knew” the target internally:
- Conditioned Agent: R² = 1.000 (Perfect internal encoding).
- All Control Agents: R² < 0.18.
The conditioned agent knows exactly what the goal is, but it chooses to act conservatively to ensure a safe landing.
3. Extra Dimensions are a “Tax”
The Noise agent performed slightly worse than the Baseline. Adding uninformative dimensions to your observation space isn’t neutral; it adds noise to gradient estimates without providing any compensating benefit.
Implications for RL Practitioners
- Evaluate Tail Risk: In this study, mean reward differences were modest (~6%), but CVaR differences were enormous (23x). Standard mean-based evaluation would have missed the primary benefit.
- Use Shuffled Controls: When claiming benefits from “contextual” policies, compare against a Shuffled control. If performance doesn’t drop, your agent isn’t actually using the context’s relationship to the reward structure.
- Probes Reveal Strategy: Probing hidden representations can distinguish between an agent that “doesn’t know the goal” and one that “knows but acts conservatively.”
Code & Full Study: https://github.com/Bhadra-Indranil/casual-policy-conditioning
I’m curious to hear from others working on non-stationary environments—have you seen similar ‘safety-first’ behavior where the agent ignores the goal signal to prioritize stability?
submitted by /u/IndividualBake4664
[link] [comments]