The Play’s the Thing

The Play's the Thing

Adding latent “play calls” to a self-play policy (DIAYN-inspired)

So far I’ve been training a standard policy π(a | s) via self-play in a multi-agent basketball environment (BasketWorld).

The extension I’m experimenting with is conditioning on a latent variable:

π(a | s, z)

where z is a discrete latent “play” that persists for multiple time steps and modulates the action distribution. Intuitively, this turns the policy from purely reactive into something closer to executing temporally extended strategies.

This is heavily inspired by DIAYN (Eysenbach et al., 2017):

  • Pretrain a set of diverse latent-conditioned behaviors (skills) without task reward
  • Use a discriminator to encourage distinguishable behaviors
  • Then reuse these skills to accelerate downstream RL

In my setup:

  • A “skill” ≈ a multi-agent play (coordinated trajectories)
  • I learn a latent-conditioned policy π(a | s, z)
  • Then add a high-level “coach” policy π(z | s) to select plays
  • Also experimenting with fixed starting formations to inject structure

So overall this becomes a hierarchical policy:

  • High level: select z (play)
  • Low level: execute via π(a | s, z)

Curious if others have tried similar latent-skill + self-play setups in multi-agent environments, especially where coordination matters. Also interested in thoughts on:

  • stability of z usage over time
  • whether to fix z for K steps vs learn termination
  • interactions with PPO-style updates in self-play

Happy to share more details if anyone’s working on similar stuff.

submitted by /u/thecity2
[link] [comments]

Liked Liked