Reproduced DreamerV4 from scratch (PyTorch); offline imagination-RL ≈ behavior cloning in closed-loop eval — here’s the teardown
I reimplemented DreamerV4 (Hafner et al., 2025) from scratch in PyTorch and ran it end-to-end, fully offline, on dm_control ball_in_cup_catch — then evaluated it closed-loop in the real environment. Sharing the setup and an honest negative result, because the “why” is more useful than another “it works” post.
The pipeline (all from scratch)
- Masked-autoencoder tokenizer (96:1 compression, MSE + 0.2·LPIPS)
- 12-layer block-causal transformer, flow-matching dynamics + bootstrap-loss curriculum
- Agent tokens + multi-token-prediction reward/continue/policy heads
- PMPO (preference-based MPO) imagination RL inside the frozen world model
- A categorical policy head (per-dim discretized; a multimodal alternative to the paper’s diagonal Gaussian)
The eval
Closed-loop in the real dm_control env, n=50 seeds — not inside imagination, where the world model grades its own student. Three policies share one world model; only the policy head differs.
Catch rate (stochastic deployment):
- random: 0.10
- behavior cloning: 0.32
- imagination-RL (PMPO): 0.38
Finding 1: imagination-RL ≈ BC
Paired sign test on the same 50 seeds: p = 0.63 (not significant). Offline RL inside the world model adds nothing measurable over plain behavior cloning here.
Why not 0.96? (it’s offline)
Online DreamerV3 hits ~0.96 with millions of self-collected env steps. My buffer is fixed and mixed-quality (Hansen demos: 39% expert, 26% poor) and itself only holds the ball ~57% of the time — so the offline ceiling is ~0.57, not 0.96. You can’t clone past your data. The policy reaches ~0.25 normalized return, about 43% of that ceiling; the rest is covariate shift.
Finding 2: the bottleneck is OOD state-coverage, not the policy head
The belief state is healthy in-distribution (its action mean ≈ the demos) and collapses only on OOD states the demos never covered. I tested the obvious offline fixes:
- Advantage-weighted BC: corr(return-to-go, action-decisiveness) ≈ 0 — the expert is “always-on,” so there’s nothing to up-weight.
- Deterministic readout (categorical head, bins in [-1,1], so no clipping artifact): mean ≈ argmax (0.17), both far below sampling (0.47). Deterministic deployment is off-distribution — the actor was trained on sampled actions (PMPO optimizes the sampled policy), so sampling is the training-consistent readout.
Neither moved the number. The conclusion I land on: closing the gap is structurally an online-RL / DAgger problem — offline can’t add the missing coverage.
Code + weights
MIT, with passing unit tests for the imagination algebra and the world-model attention firewall, and a 2-command repro of the eval:
- GitHub: https://github.com/vijayabhaskar-ev/dreamer_v4
- Weights (HF): https://huggingface.co/vijayabhaskarev/dreamer-v4
Happy to answer questions or hear where I’m wrong — particularly on the OOD-vs-mode-averaging call: mean ≈ argmax rules out strong mode-averaging, but I haven’t fully isolated mild conditional multimodality (an earlier kNN probe found ~37% mildly-multimodal neighborhoods). Next step is taking the pipeline online.
submitted by /u/vijayabhaskarev
[link] [comments]