DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

Ziyu Shan, Zhenyu Wu, Xiaofeng Wang, Zheng Zhu, Ziwei Wang

Video-based embodied world models provide an appealing substrate for robotic manipulation by predicting future states, yet current approaches remain limited by a fundamental entanglement: accurately modeling dynamics typically requires low-level temporal reasoning, while producing high-resolution frames demands expansive visual synthesis according to high-level semantics. This entanglement results in slow inference speed for iterative planning or too coarse predictions to retain contact-rich details. To solve this dilemma, we present Disentangled Video Generation World Model (DVG-WM), an efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis. Conditioned on an initial observation and a language instruction, our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos. Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents, and introduces a latent degradation mechanism to regenerate contact-rich details. Experiments on LIBERO and real-world platforms demonstrate improved video quality with up to 3.97 times acceleration, validating that disentangled video generation can be an efficient embodied world model for robotic manipulation.

Open Source

Research Brief

The DVG-WM paper introduces a method for robotic manipulation world models that disentangles dynamics prediction from visual synthesis, leading to faster and higher-fidelity video generation.

Current video-based world models for robots struggle with the trade-off between accurately predicting dynamic physical interactions and generating high-resolution visual details, leading to slow processing or blurry predictions. This paper addresses this 'entanglement' by proposing the Disentangled Video Generation World Model (DVG-WM). DVG-WM explicitly separates the learning of physical dynamics from the synthesis of visual frames. It first generates a sequence of simplified visual states to outline the interaction, then refines these into high-fidelity videos. An efficient cascading mechanism uses flow matching to directly map learned dynamics to video latents, and a latent degradation technique helps restore fine, contact-rich details. This approach results in improved video quality and significant speed-up (up to 3.97x acceleration) for robotic tasks on both simulated (LIBERO) and real-world platforms.

Potential Applications

Advanced robotic manipulation in manufacturing, allowing robots to quickly plan and execute complex assembly or handling tasks with delicate objects.
Surgical robotics, enabling more precise and safer autonomous or teleoperated procedures by predicting tissue deformation and interaction in real-time.
Household and service robots, improving their ability to interact with diverse objects and environments efficiently and robustly.
Autonomous vehicle simulation and planning, providing more realistic and faster predictive models for complex interaction scenarios like parking or off-road driving.

38/100

Paper Trustworthiness Index

High Skepticism

Skeptical / Unreviewed

This document should be treated with critical skepticism. It contains unverified scientific claims or was self-published.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

10 / 25

The abstract does not provide author names or affiliations, making it impossible to assess their track record. A neutral score is assigned assuming it originates from a standard academic or industry research group, without specific prestige information.

Technical Rigor & Methodology

23 / 30

The abstract describes a specific problem, proposes a clear architectural solution (disentanglement, cascading mechanism, flow matching, latent degradation), and reports quantitative performance metrics (video quality, 3.97x acceleration) on relevant platforms (LIBERO, real-world). This suggests a well-defined and rigorously tested technical approach.

Reproducibility & Openness

0 / 25

The abstract does not mention the availability of code, datasets, or pre-trained weights. Without explicit links or statements regarding open-sourcing, reproducibility cannot be confirmed.

Community Vetting & Peer Review

5 / 20

The abstract does not indicate whether the paper has been peer-reviewed, published in a conference, or is an unreviewed preprint. Without this information, a conservative score is assigned, as its community vetting status is unknown.

Detailed Evidence Assessment

Verified Evidence & Citations

DVG-WM explicitly decomposes world modeling into dynamics learning and visual synthesis.

“our efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis.”

The model generates a sequence of intermediate visual states and then refines them.

“our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos.”

An efficient cascading mechanism is proposed where DVG-WM uses flow matching to directly map dynamics to video latents.

“Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents.”

It introduces a latent degradation mechanism to regenerate contact-rich details.

“and introduces a latent degradation mechanism to regenerate contact-rich details.”

Experiments on LIBERO and real-world platforms demonstrate improved video quality.

“Experiments on LIBERO and real-world platforms demonstrate improved video quality”

Experiments show up to 3.97 times acceleration.

“with up to 3.97 times acceleration”

Uncertainties & Omissions

• Omission:Specific baseline models against which DVG-WM's performance (video quality, acceleration) is compared.

• Omission:Details on the specific architectures of the dynamics learning and visual synthesis components.

• Omission:Information regarding ablation studies that validate the contribution of individual components like flow matching or latent degradation.

• Omission:Availability of code, datasets, or pre-trained models for replication.

• Omission:Details on the statistical significance or variability of the reported speed-up and quality improvements.

• Omission:Peer-review status or publication venue.

• Uncertainty:The precise definition and nature of 'intermediate visual states' and the 'refinement' process.

• Uncertainty:The generalizability of 'up to 3.97 times acceleration' across a wide range of tasks and complexities.

• Uncertainty:The robustness of the 'latent degradation mechanism' in reproducing fine details across various contact scenarios.

• Uncertainty:Potential limitations or failure modes of the disentangled approach, such as scenarios where dynamics and visual synthesis are inherently harder to separate.