[Update] Continuous RL via DP in CUDA: Solving the Underactuated Double Pendulum & Hybrid 6D Solvers

[Update] Continuous RL via DP in CUDA: Solving the Underactuated Double Pendulum & Hybrid 6D Solvers

Hey r/reinforcementlearning,

Quick follow-up on my project on Continuous RL via Dynamic Programming in CUDA. In my previous tests with the Overhead Crane and Double CartPole, the policy often got stuck in “partial” solutions (e.g. Link 1 upright + Link 2 free-spinning) or periodic limit cycles.

I just shipped a fix. This remains pure DP: no LQR, no continuous policy gradients. Highlights below.

1. Underactuated Double Pendulum (4D sandbox)

I added a new runner: two coupled links on a fixed pivot. Torque is applied only at the base joint (Link 2 moves via inertial coupling).

  • State: [θ₁, ω₁, θ₂, ω₂]
  • Performance: with bins=50, the policy reaches cos(θ) = 0.999 for both links and |ω| < 0.2 rad/s. Genuine stable swing-up in ~2 seconds.
  • Why it matters: 4D trials are 100–1000x faster than the 6D version. With bins=15, a trial takes ~5 seconds, allowing a tight scientific loop when iterating on reward shaping.

2. What finally cracked the reward shaping

The key insight: DP with discrete actions creates real fixed-point limit cycles. You can’t just “brute force” them with bigger penalties; you have to design rewards that make them strictly worse than the optimum.

My current reward function uses five specific terms:

r = baseline # +0.5 — survival ≥ termination + 0.5 * (cos θ₁ + cos θ₂) # smooth gradient toward upright + 4.0 * gate**2 # quadratic in gate: max(0, c1) * max(0, c2) + 5.0 * gate**4 * (1 - ω**2/2.5)**2 # smooth "stillness bowl" - 1.0 * E_err # asymmetric energy penalty (1.5x under) - 0.5 * (c1 - c2)**2 # anti-alignment (kills "I-shape" attractor) - 0.1 * gate * (ω1**2 + ω2**2) # velocity damping ONLY when upright 

Failure modes addressed:

  • Anti-alignment penalty. Prevents the “I-shape” where Link 1 hangs down and Link 2 inverts.
  • Smooth stillness bowl. Replaced hard “cliffs” with a smooth gradient to prevent the policy from oscillating on the boundary.
  • Asymmetric energy. Pushing 1.5x harder when under-target energy was the single biggest unlock to get past the “swinging but not reaching” plateau.

3. Hybrid solver for the 6D Double CartPole

To solve the 6D variant (which is notoriously difficult), I implemented a two-stage controller logic within the DP framework:

Phase Policy When active
Swing-up Full ±π range, coarse grid Far from upright
Balance Narrow ±0.3 rad range, fine grid Near upright

Hysteresis on the switch (enter at |θ| < 0.28, exit at |θ| > 0.35) prevents rapid toggling. This gives a level of precision that’s impossible to achieve with a single global policy.

4. Autoresearch harness (the meta-tool)

This shaping wasn’t found by hand. I used an LLM agent to iterate over 30+ trials (edit coefficients → train → evaluate → score). Inspired by Karpathy’s autoresearch.

The repo now includes:

  • runners/eval_metric.py — external read-only score function.
  • runners/trial_runner.sh — one-command pipeline (clean → train → eval).
  • trial_log.md — append-only bitácora of the agent’s progress.

Sonnet 3.7/4.6 ran the loop overnight for about $1–2 in API tokens to find the optimal coefficients.

Repo: https://github.com/nicoRomeroCuruchet/DynamicProgramming

Happy to answer any questions! The most interesting finding was definitely how discrete-action DP environments create these limit-cycle attractors that act like local optima — and how reward shaping is the only way to truly “break” them.

submitted by /u/Grouchy_Ad_4112
[link] [comments]

Liked Liked