NVLabs unveils SANA-WM, a 2.6B world model that makes 60s 720p video on one GPU

The NVIDIA research team says the open-source code can turn a single image plus a camera path into a minute-long, controllable clip; weights are listed as coming soon.

By ยท

Why it matters

Long, controllable video generation has been computationally expensive. If NVLabs' minute-long world model runs on a single GPU with quality and repeatability, startups can prototype cinematic, robotics, and simulation workflows without a cluster.

NVLabs unveils SANA-WM, a 2.6B world model that makes 60s 720p video on one GPU

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, and colleagues at NVIDIA's NVLabs today introduced SANA-WM, a 2.6B-parameter world model that generates minute-long 720p video from one input image and a specified camera trajectory. The group detailed the system on the project page, alongside an arXiv paper and open-source code on GitHub; model weights are marked as "soon" on the page.

The NVLabs authors, which also include Jincheng Yu, Tong He, Song Han, and Enze Xie, pitch SANA-WM as an efficiency-first approach to long-horizon video generation. According to the paper, they trained for 15 days on 64 H100s, and can render a full 60-second clip using a single H100 at inference. A distilled variant reportedly runs on a single RTX 5090 using NVFP4 quantization, denoising a 60s 720p clip in 34 seconds.

What they built

SANA-WM is presented as a "world model" tuned natively for minute-scale sequences with precise 6-DoF camera control. Rather than free-roaming text-to-video, the system takes a starting image and an explicit camera path, then produces long, coherent rollouts at 720p. On their one-minute benchmark, the team reports action-following accuracy that exceeds prior open-source baselines and visual quality comparable to industrial systems such as LingBot-World and HY-WorldPlay, with what they claim is 36x higher throughput.

Key design elements, per the paper:

  • Hybrid Linear Attention: a frame-wise Gated DeltaNet paired with periodic softmax attention to keep memory growth compact over long contexts.
  • Dual-Branch Camera Control: a coarse global pose branch plus a pixel-aligned geometric branch to more faithfully track metric camera trajectories in 6-DoF.
  • Two-Stage Generation: a 2-stage pipeline where a dedicated 17B long-video refiner sharpens texture, motion, and late-window consistency atop the long-rollout backbone.
  • Robust Annotation Pipeline: metric-scale 6-DoF camera poses are extracted from public videos to supervise action labels; the team cites roughly 213K clips in training.

How it works

The hybrid attention approach is designed to avoid the out-of-memory failure modes that plague all-softmax transformers at minute scale. NVLabs shows latency and peak memory scaling experiments in the project page figures suggesting their recurrent variants hold memory use in check as duration grows to 60 seconds, while preserving scene coherence over long horizons.

Camera control is a first-class input. Users can define precise 6-DoF camera trajectories; the model conditions on both a global pose and fine, pixel-level geometric signals to adhere to the path. The result aims to enable repeatable, controllable shots rather than drifting, emergent motion.

The second-stage refiner is where fidelity is recovered. After the minute-long rollout is generated by the 2.6B backbone, a larger 17B refiner network improves high-frequency detail, motion sharpness, and late-sequence quality, which commonly degrades in long videos.

Availability and licensing

The team has released the source code and detailed documentation on the project page. The page lists "Models soon" for the weights. The paper benchmarks and qualitative examples are live; productionization details and license terms for model checkpoints were not specified on the page at publication time.

Where it fits

Minute-scale, controllable video has typically required large fleets of GPUs or compromises in resolution and coherence. By targeting one-minute 720p generation on a single H100, and offering a path to sub-minute inference on a single consumer GPU via distillation and quantization, NVLabs is positioning SANA-WM as an accessible research baseline for long-horizon video control. If the forthcoming weights match the paper's results, teams building synthetic data pipelines, previs, robotics simulators, or cinematic tools may find a practical starting point without resorting to massive inference clusters.

Reader comments

Conversation for this story loads after sign-in.