Reproduced DreamerV4 from scratch (PyTorch); offline imagination-RL ≈ behavior cloning in closed-loop eval — here’s the teardown

I reimplemented DreamerV4 (Hafner et al., 2025) from scratch in PyTorch and ran it end-to-end, fully offline, on dm_control ball_in_cup_catch — then evaluated it closed-loop in the real environment. Sharing the setup and an honest negative result, because the “why” is more useful than another “it works” post.

The pipeline (all from scratch)

  • Masked-autoencoder tokenizer (96:1 compression, MSE + 0.2·LPIPS)
  • 12-layer block-causal transformer, flow-matching dynamics + bootstrap-loss curriculum
  • Agent tokens + multi-token-prediction reward/continue/policy heads
  • PMPO (preference-based MPO) imagination RL inside the frozen world model
  • A categorical policy head (per-dim discretized; a multimodal alternative to the paper’s diagonal Gaussian)

The eval

Closed-loop in the real dm_control env, n=50 seeds — not inside imagination, where the world model grades its own student. Three policies share one world model; only the policy head differs.

Catch rate (stochastic deployment):

  • random: 0.10
  • behavior cloning: 0.32
  • imagination-RL (PMPO): 0.38

Finding 1: imagination-RL ≈ BC

Paired sign test on the same 50 seeds: p = 0.63 (not significant). Offline RL inside the world model adds nothing measurable over plain behavior cloning here.

Why not 0.96? (it’s offline)

Online DreamerV3 hits ~0.96 with millions of self-collected env steps. My buffer is fixed and mixed-quality (Hansen demos: 39% expert, 26% poor) and itself only holds the ball ~57% of the time — so the offline ceiling is ~0.57, not 0.96. You can’t clone past your data. The policy reaches ~0.25 normalized return, about 43% of that ceiling; the rest is covariate shift.

Finding 2: the bottleneck is OOD state-coverage, not the policy head

The belief state is healthy in-distribution (its action mean ≈ the demos) and collapses only on OOD states the demos never covered. I tested the obvious offline fixes:

  • Advantage-weighted BC: corr(return-to-go, action-decisiveness) ≈ 0 — the expert is “always-on,” so there’s nothing to up-weight.
  • Deterministic readout (categorical head, bins in [-1,1], so no clipping artifact): mean ≈ argmax (0.17), both far below sampling (0.47). Deterministic deployment is off-distribution — the actor was trained on sampled actions (PMPO optimizes the sampled policy), so sampling is the training-consistent readout.

Neither moved the number. The conclusion I land on: closing the gap is structurally an online-RL / DAgger problem — offline can’t add the missing coverage.

Code + weights

MIT, with passing unit tests for the imagination algebra and the world-model attention firewall, and a 2-command repro of the eval:

Happy to answer questions or hear where I’m wrong — particularly on the OOD-vs-mode-averaging call: mean ≈ argmax rules out strong mode-averaging, but I haven’t fully isolated mild conditional multimodality (an earlier kNN probe found ~37% mildly-multimodal neighborhoods). Next step is taking the pipeline online.

submitted by /u/vijayabhaskarev
[link] [comments]

Liked Liked