Simplifying Deep Temporal Difference Learning
tl;dr The authors propose PQN, a simplified deep online Q-Learning that uses very small replay buffers. Normalization and parallelized sampling from vectorized environments stabilizes training without the need for huge replay buffers. PQN is competitive with more complex methods such as Rainbow, PPO-RNN, QMix while being 50x faster than traditional DQN. Introduction Temporal difference (TD) methods can be simple and efficient, but are notably unstable when combining them with neural networks or off-policy sampling. All of the following […]