The Reward Scaling Problem in Reinforcement Learning for Quadruped Robots: Unstable Bipedal Behavior, Jitter, and Command Leakage
Hi all,
I’m training a quadruped robot (Isaac Gym / legged_gym style) and trying to achieve a policy that switches between:
– command = 0 → stable quadruped standing
– command = 1 → stable bipedal standing (hind legs only)
However, I’m facing several issues that seem related to reward scaling and interference between reward terms.
Current reward components:
– zero linear/angular velocity tracking
– projected gravity alignment
– quadruped base height reward
– bipedal base height reward
– jerk penalty
– acceleration penalty
– action rate penalty
– front feet air-time reward (for bipedal)
– hind feet contact reward
– alive reward
– collision penalty
Problems observed:
-
Command leakage:
– Under bipedal command (1), the robot still walks around instead of stabilizing
– Motion seems weakly correlated with command input
-
High-frequency jitter:
– After standing up, joints exhibit rapid small oscillations
– Especially severe in bipedal stance
-
Mode confusion:
– Under quadruped command (0), the robot sometimes adopts partial bipedal poses
– e.g., lifting two legs or asymmetric stance
Questions:
-
How do you typically balance competing reward terms in multi-modal behaviors like this?
-
Are there known tricks to enforce stronger “mode separation” between commands?
-
What are common causes of high-frequency jitter in RL locomotion policies? Is it usually due to insufficient action smoothing penalties or conflicting rewards?
Any insights or references would be greatly appreciated!
submitted by /u/Obvious-Mixture-6607
[link] [comments]