The Reward Scaling Problem in Reinforcement Learning for Quadruped Robots: Unstable Bipedal Behavior, Jitter, and Command Leakage

Hi all,

I’m training a quadruped robot (Isaac Gym / legged_gym style) and trying to achieve a policy that switches between:

– command = 0 → stable quadruped standing

– command = 1 → stable bipedal standing (hind legs only)

However, I’m facing several issues that seem related to reward scaling and interference between reward terms.

Current reward components:

– zero linear/angular velocity tracking

– projected gravity alignment

– quadruped base height reward

– bipedal base height reward

– jerk penalty

– acceleration penalty

– action rate penalty

– front feet air-time reward (for bipedal)

– hind feet contact reward

– alive reward

– collision penalty

Problems observed:

  1. Command leakage:

    – Under bipedal command (1), the robot still walks around instead of stabilizing

    – Motion seems weakly correlated with command input

  2. High-frequency jitter:

    – After standing up, joints exhibit rapid small oscillations

    – Especially severe in bipedal stance

  3. Mode confusion:

    – Under quadruped command (0), the robot sometimes adopts partial bipedal poses

    – e.g., lifting two legs or asymmetric stance

Questions:

  1. How do you typically balance competing reward terms in multi-modal behaviors like this?

  2. Are there known tricks to enforce stronger “mode separation” between commands?

  3. What are common causes of high-frequency jitter in RL locomotion policies? Is it usually due to insufficient action smoothing penalties or conflicting rewards?

Any insights or references would be greatly appreciated!

submitted by /u/Obvious-Mixture-6607
[link] [comments]

Liked Liked