CAPO: A Coherence-Adaptive Advantage Mechanism Instantiated in Proximal Policy Optimization

Performance degradation caused by temporal-difference signal fluctuations and the inflexibility of fixed proximal constraints remains a key challenge in policy optimization algorithms. To address these issues, this article develops a Coherence-Adaptive Proximal Optimization (CAPO) method. We first derive a temporal-coherence-aware advantage estimation mechanism by measuring the directional consistency of temporal-difference residuals within a local time window. Based on this mechanism, short-horizon and long-horizon advantage estimates are adaptively integrated to provide a more stable and temporally reliable policy improvement signal. CAPO is then instantiated within the PPO framework by incorporating the temporal coherence mechanism into both advantage construction and proximal policy updates. The proposed method preserves the stable optimization structure of PPO while adaptively adjusting the clipping range according to sample-wise temporal coherence. In this way, CAPO allows more sufficient policy improvement when the learning signal is temporally consistent and imposes more conservative updates when the signal is unstable. Experiments on several representative OpenAI Gym control tasks show that CAPOachieves better or comparable performance than standard PPO in most benchmark environments, with improved training stability and convergence behavior, especially in tasks where local temporal feedback provides reliable policy improvement information.

Liked Liked