Why do significant improvements to my critic not improve my self-play agents?

digitado ⋅ 24 de March de 2026

I’ve been working on a tricky zero-sum multi-agent RL problem for a while, implementing a PFSP (probabilistic fictitious self-play) callback that has greatly improved my results so far. Since PFSP stochastically selects a past checkpoint for the learning policy to play against, I considered that informing the critic of the opponent’s identity, and allowing it to learn embeddings for each past policy that influence its value predictions, would improve performance for the same reason that MAPPO performs better than pure PPO in multi-agent settings (more stable advantage estimates).

Instead, unfortunately, I’ve seen worse results during my initial testing. Value function loss is the same or higher, explained variance in state value is the same or lower (see attached image), and the agents that were produced by this training run have substantially worse Bradley-Terry ratings than agents produced by an equivalent run without this modification. I’m rather surprised by this; it seems like it shouldn’t have turned out this way.

It’s possible that this is just an artifact of randomness, and the training run with the improved critic happened to settle into an unlucky local minima. Still, I would expect that letting the critic know which opponent our learning agent is playing against would substantially improve learning performance, given that the opponent policy is perhaps the single most important factor determining the odds of victory. A critic that is blind to opponent identity would, in expectation, produce vastly less stable gradients than one that isn’t.

Possible explanations that I’ve ruled out, at least partially: – I’m currently using gamma=0.999 and lambda=0.8. The former would certainly mitigate a better critic’s value-add, but the latter should cancel that out, so I’m fairly convinced that hyperparameters aren’t the problem. – I’ve manually gone in and tested each of the critic embeddings, and they do result in substantially better value predictions than randomly-selected counterfactual predictions. In particular, the critic (correctly) consistently rates the same environment state as less promising when faced with a stronger opponent. I don’t think the implementation is broken. – The initial high loss and low EV in the experimental run are explained – the agent is initialized from a pretrained model taken from a single-agent environment, so the significance of opponent identity is something it has to learn from scratch. It’s currently just a new learned embedding vector being fed into a transformer alongside the embeddings of each environment object. Should I be doing something differently, there?

Does anyone have thoughts on how I could better approach this, or what I might be missing? Link to my implementation of an identity-aware critic encoder, in case it’s of use to anyone reading.

submitted by /u/EngineersAreYourPals
[link] [comments]

Like 0

Liked Liked