Solving Meta RL benchmark Alchemy form Deepmind with Epiplexity

digitado ⋅ 8 de January de 2026

🧪 I was able to finally solve DeepMind’s Alchemy Meta RL benchmark using a new theoretical framework: Epiplexity

For many years, I’ve been working on DeepMind’s Alchemy meta-reinforcement learning benchmark as a side project – a notoriously difficult task that requires agents to discover hidden “chemistry rules” that get shuffled each episode.

The breakthrough: Instead of only selecting models by reward, I select by epiplexity – a measure of structural information extraction from the recent paper “From Entropy to Epiplexity” (Finzi et al., 2026).

The key insight: Reward tells you what the agent achieved. Epiplexity tells you how much the agent learned.

It’s a simple idea. Here’s how it works:

– Clone the current model into variants A (low exploration) and B (high exploration)

– Run both through the same episode

– Keep whichever learned more structure (higher epiplexity)

Repeat

Scores > 160 are seen after around 700 episodes. After ~1500 episodes: ~200 reward per episode ✅ This is achieved with no modification of the action or state space and fully online via A2C.

This creates evolutionary pressure toward models that extract transferable knowledge rather than overfit to episode-specific noise.

📄 Paper that inspired this: arxiv.org/abs/2601.03220

The code: https://github.com/RandMan444/epiplexity-alchemy/blob/main/A2C_EPN_Epiplexity_Public.ipynb

submitted by /u/Ok_Introduction9109
[link] [comments]

Like 0

Liked Liked