How to figure out what plummeted my reward?

How to figure out what plummeted my reward?

This is a plot of rewards vs steps in a MARL cooperative game with interdependent agents and a global objective. Thought it went into the right direction but seems like everything completely derailed.

I’m using the CTDE paradigm in combination with a dreamer, with an adaptive and entropy dependent episode length. I decided to do this because the agents seemed to get into deadlock. So first I wanted to sample from very successful smaller episodes and then increase it as there are proportionally more good samples in the replay buffer. This seems to work in the beginning (0-250k steps) but degrades steadily afterwards.

Has something like this happened to any of you before? I’m not sure on what to do, especially since the learning was so steep in the beginning.

Edit:

Hypothesis #1: The actors are trained during the dreaming process. During imagination, the expected reward signals might be higher than the real ones.

Counterargument #1: the world model is learned as well so if the rewards do not correspond, then that model should change the behavior of the actors. Imagination and world model loss are dropping though.

submitted by /u/Markovvy
[link] [comments]

Liked Liked