Help with PPO (reward not increasing)
I’m working on an optimization problem with a complex environment. Environment is complex in inner working but has only one action input. The action can be either binary, or discrete, or continuous. If the environment is optimized on binary action the maximum reward will be less than when on discrete or continuous actions. PPO works when action is binary or discrete but not when it’s continuous. The input to the model needs to be a value between 0 and some maximum value x. So, I designed the model to predict a mean between -1 and 1, with standard deviation a state independent parameter starting at 1. If sample is -ve, action is set to 0 else the action is obtained by scaling sample by x and clamping between 0 and x.
Turns out when doing so my model is not able to learn. If I use entropy loss, the entropy of the model increase with no bound, if i don’t use the entropy loss, it collapses to near zero. Does anyone have idea, what i might be doing wrong or how to make it work. Note that the environment can have at max 25 timesteps with reward guaranteed to be obtained at the last timestep. I’ve tried running for 2 million timesteps.
submitted by /u/reddo-lumen
[link] [comments]