DQN reward stagnation

I’m working on a project that involves a DQN trying to optimize some experiments that I have basically gamified to try to reward exploration/diversity of trajectories. I understand the fundamentals underlying DQN but haven’t worked extensively with them prior to this project so I don’t have much intuition built up on it yet. I’ve seen varying ideas regarding training params– I’m training for 200k steps (each step the agent makes 4 actions), but I’m not sure how I should be choosing my replay buffer size, batch size, and target network update frequency. I’ve had weird training where the loss converges quickly and reward has absolutely no change, and I’ve also had training where loss sort of converges but reward decreases over training… Especially for target updates I’ve seen recommendations from 10 steps to 3000 steps, so pretty confused on that. Any recommendations/materials I should read?

submitted by /u/owj2082
[link] [comments]

Liked Liked