The Play’s the Thing
|
Adding latent “play calls” to a self-play policy (DIAYN-inspired) So far I’ve been training a standard policy π(a | s) via self-play in a multi-agent basketball environment (BasketWorld). The extension I’m experimenting with is conditioning on a latent variable: π(a | s, z) where z is a discrete latent “play” that persists for multiple time steps and modulates the action distribution. Intuitively, this turns the policy from purely reactive into something closer to executing temporally extended strategies. This is heavily inspired by DIAYN (Eysenbach et al., 2017):
In my setup:
So overall this becomes a hierarchical policy:
Curious if others have tried similar latent-skill + self-play setups in multi-agent environments, especially where coordination matters. Also interested in thoughts on:
Happy to share more details if anyone’s working on similar stuff. submitted by /u/thecity2 |