The Reward Scaling Problem in Reinforcement Learning for Quadruped Robots: Unstable Bipedal Behavior, Jitter, and Command Leakage

digitado ⋅ 1 de April de 2026

Hi all,

I’m training a quadruped robot (Isaac Gym / legged_gym style) and trying to achieve a policy that switches between:

– command = 0 → stable quadruped standing

– command = 1 → stable bipedal standing (hind legs only)

However, I’m facing several issues that seem related to reward scaling and interference between reward terms.

Current reward components:

– zero linear/angular velocity tracking

– projected gravity alignment

– quadruped base height reward

– bipedal base height reward

– jerk penalty

– acceleration penalty

– action rate penalty

– front feet air-time reward (for bipedal)

– hind feet contact reward

– alive reward

– collision penalty

Problems observed:

Command leakage:

– Under bipedal command (1), the robot still walks around instead of stabilizing

– Motion seems weakly correlated with command input
High-frequency jitter:

– After standing up, joints exhibit rapid small oscillations

– Especially severe in bipedal stance
Mode confusion:

– Under quadruped command (0), the robot sometimes adopts partial bipedal poses

– e.g., lifting two legs or asymmetric stance

Questions:

How do you typically balance competing reward terms in multi-modal behaviors like this?
Are there known tricks to enforce stronger “mode separation” between commands?
What are common causes of high-frequency jitter in RL locomotion policies? Is it usually due to insufficient action smoothing penalties or conflicting rewards?

Any insights or references would be greatly appreciated!

Like 0

Liked Liked