NVLabs unveils SANA-WM for minute-long 720p video with 6-DoF camera control

The NVLabs project adds 6-DoF camera control, scene persistence, and a refiner, with a distilled NVFP4 variant that denoises a 60s clip in 34s on an RTX 5090.

By Staff · 2026-05-15T05:51:13.170Z

Why it matters

Long-horizon, camera-controllable video that runs on a single GPU pushes world models from demo to practical tooling for simulation, robotics, and embodied AI. Open-sourcing the weights and code gives researchers and startups a reproducible baseline to test, fine-tune, and benchmark at minute scale.

NVLabs unveils SANA-WM for minute-long 720p video with 6-DoF camera control

Haoyi Zhu (@HaoyiZhu) announced SANA-WM, a 2.6B-parameter open-source world model that generates minute-long 720p video with text, a single image, and a 6-DoF camera trajectory, in a thread on X. The system targets controllable 60-second clips on a single GPU.

Watch the original launch video on X: @HaoyiZhu's post.

"The goal is simple: make long-horizon world modeling practical," Zhu wrote on X. Instead of stitching short clips, SANA-WM is trained natively for 1-minute generation with precise camera control and strong scene persistence. Under the hood, the team highlights a hybrid GDN + softmax attention setup for long-context efficiency, a dual-branch camera-control module for trajectory following, a long-video refiner for fidelity, and the use of robust pose labels from public videos. Full details are in the project page and the paper.

Efficiency numbers are the point of the release. Zhu says SANA-WM trained on roughly 213K public video clips in 15 days on 64 H100s, and generates each 60s 720p clip on a single GPU. A distilled NVFP4 variant denoises a 60s clip in 34s on an RTX 5090. On the team's one-minute world-model benchmark, they report stronger action-following than prior open-source baselines, comparable visual quality, and up to 36x higher throughput. Code is available at NVLabs/Sana.

Why it matters

Why it matters

Reader comments