Smoothed action sampling for gymnasium style environments
Various training algorithms in RL either make use of an occasional “explore” random action or collect initial random episodes to bootstrap the training.
However a general issue with random sampling – specially for delta-time step physics simulations – is that the actions average over a middle point within the action space.
This makes the agent’s “random” trajectory wiggling close to one applying the average over the action space. e.g. in CarRacing it just incoherently slams steering, throttle and brakes, resulting in a short, low reward trajectory or in MountainCar a random action doesn’t move the cart too far before the episode ends.
Just tested it in MountainCar (continuous and discrete action versions) and the “blind” smooth random action outperforms the environment’s random sample in providing useful (state, action, reward) trajectories to boostrap training.
Have fun!
submitted by /u/blimpyway
[link] [comments]