Continuous RL via Dynamic Programming in CUDA (Solving Overhead Crane, Double CartPole, etc.)

digitado ⋅ 17 de April de 2026

I built a highly parallel CUDA implementation of Policy Iteration for continuous state/action spaces using barycentric interpolation. It solves complex systems like an Overhead Crane and Double CartPole without relying on standard deep RL methods.

I’ve been working on this based on the theoretical framework from “Continuous RL via Dynamic Programming” (Dupuis & Kushner). I’m sharing it here because I think this approach is heavily underrepresented compared to DQN/PPO and deserves more attention.

Most RL implementations discretize the problem and call it a day. This framework is more principled: it starts from the continuous Hamilton-Jacobi-Bellman PDE and derives a discrete scheme that provably converges to its solution as grid resolution increases.

The key ingredient is barycentric interpolation: after a forward Euler step, the next state lands between grid nodes. Instead of snapping to the nearest node, the value is recovered as a convex combination over the enclosing hypercube corners. This preserves second-order accuracy without explicit error correction.

The operator F^δ is a contraction mapping with modulus λ = γ^τ_min < 1, so by Banach’s theorem, convergence to the unique optimal value function is guaranteed regardless of initialization.

Each environment injects its dynamics as a raw CUDA C device function compiled at runtime via NVRTC. The Bellman update is embarrassingly parallel — one GPU thread per grid node, meaning zero inter-thread communication is needed.

// For each grid node ξᵢ (parallel, one CUDA thread)

for each action u ∈ U:

 `η ← ξᵢ + τ * f(ξᵢ, u) `V_next ← barycentric_interp(η, V) `Q ← r(ξᵢ, u) + γ^τ * V_next` `V(ξᵢ) ← max Q` `π(ξᵢ) ← argmax Q`

Environments Solved

CartPole (4D, 30⁴ grid)
Pendulum swing-up (2D, 200² grid)
Mountain Car discrete and continuous (2D, 200² grid)
Double CartPole (6D, 12⁶ grid — memory scales brutally)
Overhead Crane anti-sway (4D, 30⁴ grid)

This was the hardest to get right. The system is a trolley (1 kg) carrying a suspended load (5 kg) on a 1.5 m rope. The goal is to move the trolley from x=+2.5 m to x=−2.5 m while aggressively damping load swing.

The 5:1 mass ratio creates a nasty coupling: accelerating the trolley swings the load backward, which then physically pulls the trolley back. This is classic input-shaping territory.

What finally made it work:

Tight reward normalization: Using θ_norm = π/6 (30°) instead of π/2 means even a 15° swing gives a penalty of 0.25. The agent actually learns to care about small angles.
Angular velocity term in the reward: Without penalizing θ̇, the policy lets the pendulum oscillate as long as the angle is occasionally near zero. Adding 0.30·θ̇² teaches it to actively damp the swing.
Expanding the velocity grid: With ±30 N force on a 6 kg system, acceleration is ~5 m/s². The original ±2 m/s velocity grid was saturating in under 0.4 seconds. I expanded it to ±4 m/s.

The resulting policy executes proper input-shaping behavior entirely on its own—it emerges strictly from the reward structure and the dynamics.

Repo: https://github.com/nicoRomeroCuruchet/DynamicProgramming.git

I’d love to hear your thoughts on this approach, especially if anyone else has experimented with continuous dynamic programming over neural net approximations. Happy to answer any questions about the CUDA implementation!

submitted by /u/Grouchy_Ad_4112
[link] [comments]

Like 0

Liked Liked