Have I discovered a SOTA probabilistic value head loss?
…or have I made some kind of critical mistake somewhere? A while ago, I made a post here discussing techniques for optimizing a value head that predicts both the mean and the variance of values from a given state. I was having some trouble, and had looked at a few papers but found no solutions that performed adequately on even a quite simple toy environment, consisting of three ‘doors’ leading to next-states with unique reward distributions. The first […]