PPO agent for network control

I built a PPO-Agent to control flows inside a physical network. The agent controls the 15 control variables, which in physical world would mean how strong we are pumping the medium inside the network. It is working after 25 million environment steps. I was testing different reward functions and so far the best was something like following:

reward = -1 * tanh(physical_violations_in_network) + 0.05 * tanh(violation_improvement_from_previous_step) - 0.07 * tanh(violation_deterioration_from_previous_step) 

I made the improvement coef and deterioration coef different in order to reduce the oscilation. It helps in a way but not perfectly. I tried also removing improvement/deterioration part however then the agent performs worse. Could someone give me feedback? or tell me if I am doing something stupid?

submitted by /u/Icedkk
[link] [comments]

Liked Liked