[Update] Continuous RL via DP in CUDA: Solving the Underactuated Double Pendulum & Hybrid 6D Solvers
|
Quick follow-up on my project on Continuous RL via Dynamic Programming in CUDA. In my previous tests with the Overhead Crane and Double CartPole, the policy often got stuck in “partial” solutions (e.g. Link 1 upright + Link 2 free-spinning) or periodic limit cycles. I just shipped a fix. This remains pure DP: no LQR, no continuous policy gradients. Highlights below. 1. Underactuated Double Pendulum (4D sandbox)I added a new runner: two coupled links on a fixed pivot. Torque is applied only at the base joint (Link 2 moves via inertial coupling).
2. What finally cracked the reward shapingThe key insight: DP with discrete actions creates real fixed-point limit cycles. You can’t just “brute force” them with bigger penalties; you have to design rewards that make them strictly worse than the optimum. My current reward function uses five specific terms:
Failure modes addressed:
3. Hybrid solver for the 6D Double CartPoleTo solve the 6D variant (which is notoriously difficult), I implemented a two-stage controller logic within the DP framework:
Hysteresis on the switch (enter at |θ| < 0.28, exit at |θ| > 0.35) prevents rapid toggling. This gives a level of precision that’s impossible to achieve with a single global policy. 4. Autoresearch harness (the meta-tool)This shaping wasn’t found by hand. I used an LLM agent to iterate over 30+ trials (edit coefficients → train → evaluate → score). Inspired by Karpathy’s autoresearch. The repo now includes:
Sonnet 3.7/4.6 ran the loop overnight for about $1–2 in API tokens to find the optimal coefficients. Repo: https://github.com/nicoRomeroCuruchet/DynamicProgramming Happy to answer any questions! The most interesting finding was definitely how discrete-action DP environments create these limit-cycle attractors that act like local optima — and how reward shaping is the only way to truly “break” them. submitted by /u/Grouchy_Ad_4112 |