From Seeing to Knowing the World: A Survey of Vision World Models
Acquiring world knowledge directly from visual observation is fundamental to Artificial General Intelligence (AGI). To support this capability, the Vision World Model (VWM) has emerged as a key paradigm, which learns how the world evolves over time from visual streams. However, recent progress has been driven by diverse research communities, resulting in inconsistent problem formulations, disconnected taxonomies, and divergent evaluation protocols. We argue that addressing this gap requires a conceptual shift: vision should not be treated merely as an input modality, but as the primary driver shaping how world models are represented, learned, and evaluated. Guided by this vision-centric perspective, we introduce a unified framework that organizes VWM research into three core components: vision encoding, knowledge learning, and controllable simulation, and use it to analyze existing model designs and evaluation methodologies. Finally, we outline future research directions that emphasize stronger physical and causal grounding, more meaningful evaluation beyond visual appearance, and scaling toward more general and reliable world modeling capabilities.