Variational Proximal Policy Optimization
arXiv:2606.08032v1 Announce Type: new
Abstract: Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization ((textsc{VP}_2textsc{O})), a particle-based variational inference framework that maps policy optimization to Stein Variational Gradient Descent within a Mixture-of-Experts architecture. By leveraging functional kernels over localized expert prototypes alongside an expert orthogonalization loss, (textsc{VP}_2textsc{O}) introduces a geometry-based proximal-control mechanism that can reduce reliance on fixed clipping or KL schedules. Our results on a 33B/4B sparse Mixture-of-Experts model show several improvements across complex reasoning benchmarks, establishing a (+mathbf{179}) ELO gain on Codeforces and a (mathbf{32%}) reduction in token count on AIME mathematical reasoning tasks.