DVD-JEPA: an open-source, fully-reproducible JEPA world model [P]

DVD-JEPA: an open-source, fully-reproducible JEPA world model [P]

A paper currently trending on paperswithcode.co in the “Anomaly Detection” category is DVD-JEPA.

https://i.redd.it/r6fd8n3d4f8h1.gif

Here is the short summary:

Most attempts to learn a world model from video try to predict the next frame pixel-by-pixel, and drown in detail that is fundamentally unpredictable. JEPA (Joint-Embedding Predictive Architecture, LeCun 2022) makes a different bet: predict the representation of the future, not the pixels, and let the encoder discard whatever it cannot predict.

DVD-JEPA is the smallest honest demonstration of that idea we could build. The “world” is a DVD logo bouncing in a 16×16 box. A context encoder, an EMA target encoder, and a latent predictor are trained — with no labels and no decoder — to predict the next observation in a 32-dimensional representation space. We then show three things:

  1. It learned the world. A linear probe recovers the logo’s exact (y, x) position from the frozen 32-d latent to within 0.73 px — though it was never given a coordinate.
  2. It can dream (once you add a decoder). Bolt an optional decoder onto the frozen latents and roll the predictor forward: it renders a correct future-frame video of the bounce, including wall reflections, for ~20 steps before latent drift sets in.
  3. It is useful. Run it as a 1-step predictive monitor, and the prediction error becomes an anomaly signal: inject a teleport and surprise spikes 88× over baseline, on the right frame.

The whole thing runs client-side in your browser — the trained MLPs are re-implemented in ~40 lines of JavaScript. It is a joke, and it is also a correct, working instance of the architecture behind I-JEPA, V-JEPA, and V-JEPA 2.

Find the paper, HF model, and project page here: https://paperswithcode.co/paper/98361

submitted by /u/NielsRogge
[link] [comments]

Liked Liked