DVD-JEPA: an open-source, fully-reproducible JEPA world model [P]
|
A paper currently trending on paperswithcode.co in the “Anomaly Detection” category is DVD-JEPA. https://i.redd.it/r6fd8n3d4f8h1.gif Here is the short summary: Most attempts to learn a world model from video try to predict the next frame pixel-by-pixel, and drown in detail that is fundamentally unpredictable. JEPA (Joint-Embedding Predictive Architecture, LeCun 2022) makes a different bet: predict the representation of the future, not the pixels, and let the encoder discard whatever it cannot predict. DVD-JEPA is the smallest honest demonstration of that idea we could build. The “world” is a DVD logo bouncing in a 16×16 box. A context encoder, an EMA target encoder, and a latent predictor are trained — with no labels and no decoder — to predict the next observation in a 32-dimensional representation space. We then show three things:
The whole thing runs client-side in your browser — the trained MLPs are re-implemented in ~40 lines of JavaScript. It is a joke, and it is also a correct, working instance of the architecture behind I-JEPA, V-JEPA, and V-JEPA 2. Find the paper, HF model, and project page here: https://paperswithcode.co/paper/98361 submitted by /u/NielsRogge |