A Diffusion Analysis of Policy Gradient for Stochastic Bandits
arXiv:2603.10219v1 Announce Type: new
Abstract: We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $eta = O(Delta^2/log(n))$ the regret is $O(k log(k) log(n) / eta)$ where $n$ is the horizon and $Delta$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $eta = O(Delta^2)$.
Like
0
Liked
Liked