arXiv Endorsement Request – cs.LG/cs.AI –Identified two optimization pathologies in Multi-Timescale PPO
Hey guys,
I am an undergrad researcher finalizing a preprint on multi-timescale temporal credit assignment, and I am looking for an arXiv endorsement for cs.LG (or cs.AI).
Title: Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
TL;DR:
We investigated why dynamically routing multi-timescale advantages inside an Actor-Critic architecture often leads to policy collapse. We formally diagnosed two pathologies:
1.Surrogate Objective Hacking: Differentiable routing allows the PPO policy gradient to hijack attention weights, artificially minimizing the clipped surrogate loss while ignoring physical control.
2.Paradox of Temporal Uncertainty: Gradient-free routing via inverse-variance forces irreversible myopic degeneration, as Softmax disproportionately locks onto short-term horizons due to their naturally lower aleatoric uncertainty.
Solution: We propose “Target Decoupling”, isolating the Actor to the purest long-term advantage while maintaining multi-timescale predictions purely for the Critic’s auxiliary representation.
Code: I have prepared a strict Minimal Reproducible Example (MRE)—4 clean, standalone Python scripts (Standard MLPs only) that definitively reproduce the crashes and the final solution on LunarLander-v2.
Please check this link:
https://zenodo.org/records/19497907
(The GitHub repo is preparing).
If your expertise aligns and you find this diagnosis interesting, I would be incredibly grateful for an endorsement. Please leave a comment or DM me if you can help. Thank you!
submitted by /u/dlwlrma_22
[link] [comments]