WoW-world-model
WoW 1 Wan 14B 600k
WoW-1-Wan-14B is a 14-billion-parameter generative world model trained on 2 million real-world robot interaction trajectories. It is designed to imagine, reason, and act in physically consistent environments, powered by SOPHIA-guided refinement and a co-trained Inverse Dynamics Model. This model is part of the WoW (World-Omniscient World Model) project, introduced in the paper: > WoW: Towards a World omniscient World model Through Embodied Interaction > Chi et al., 2025 – arXiv:2509.22642 - 14B parameters trained on 2M robot interaction samples - Learns causal physical reasoning from embodied action - Generates physically consistent video and robotic action plans - Uses SOPHIA, a vision-language critic, to refine outputs - Paired with an Inverse Dynamics Model to complete imagination-to-action loop - 2M Real-world robot interaction trajectories - Multimodal scenes including vision, action, and language - Diverse mixture captions for better generalization 🧠 Mixture Caption Strategy - Prompt Lengths: - Short: "The Franka robot, grasp the red bottle on the table" - Long: "The scene... open the drawer, take the screwdriver, place it on the table..." - Robot Model Mixing: - Captions reference various robot types - Example: "grasp with the Franka Panda arm", "use end-effector to align" - Action Granularity: - Coarse: "move to object" - Fine: "rotate wrist 30° before grasping" This dataset will be continuously updated with: - More trajectories - Richer language - Finer multimodal annotations - Zero-shot video generation in robotics - Causal reasoning and physics simulation - Long-horizon manipulation planning - Forward and inverse control prediction - 🧠 Project page: wow-world-model.github.io - 💻 GitHub repo: wow-world-model/wow-world-model - 📊 Dataset: WoW-1 Benchmark Samples