LanceDB says NVIDIA used Lance datasets to curate Cosmos 3 training data

NVIDIA's report describes Cosmos 3 as an omnimodal model family; LanceDB's claim ties the release to training-data infrastructure.

By Ryan Merket · Published Jun 2, 2026, 1:27am CT

Why it matters

If LanceDB's claim is accurate, Cosmos 3 is evidence that physical AI training infrastructure is being shaped not just by models and GPUs, but by the storage layer used to curate vast multimodal datasets.

The intricate process of data curation, where raw, disparate data is meticulously structured and refined to build a complex AI model. (scratchboard / woodcut)

LanceDB said in a post on X that NVIDIA's Cosmos 3 training pipeline was built on LanceDB, with NVIDIA's internal SILA curation platform processing "tens of billions" of multimodal training candidates as a single unified Lance dataset.

LanceDB on X

The post points to the infrastructure section of NVIDIA's Cosmos 3 technical report, published June 1. The LanceDB/SILA wording is not visible in the scraped excerpt of the PDF supplied to RuntimeWire, so the data-layer claim should be read as LanceDB's assertion unless checked against the full report text. NVIDIA's report does independently describe Cosmos 3 as a family of "omnimodal world models" for physical AI that jointly process and generate language, image, video, audio, and action sequences.

That makes the post more than a customer-logo flex for LanceDB. If the claim holds, NVIDIA's pipeline was not only a GPU and model-architecture story. It also depended on a data curation layer able to treat an enormous, mixed-modality candidate pool as one dataset before training.

What LanceDB is claiming

LanceDB's specific claim is that SILA, NVIDIA's internal curation platform, handled tens of billions of multimodal candidates as a single Lance dataset. The number is not exact, and neither the X post nor the excerpted report provides a denominator, final selected dataset size, or the amount of data discarded during curation.

Still, the phrasing matters. Training pipelines for physical AI have to combine formats that do not naturally behave like rows in a conventional table: language, images, video, audio, robot actions, driving scenes, warehouse simulations, and other high-dimensional data. LanceDB is positioning Lance, its columnar vector storage format, as the layer that can keep that material unified rather than splintered across modality-specific stores.

For LanceDB, the timing attaches its storage architecture to one of NVIDIA's most visible physical AI releases. For NVIDIA, the report frames Cosmos 3 as an attempt to collapse multiple model categories into one system: vision-language models, video generators, world simulators, and world-action models.

What NVIDIA confirmed in the report

NVIDIA says Cosmos 3 uses a unified mixture-of-transformers architecture and supports flexible input-output configurations across five modalities: language, image, video, audio, and action. The company also says it is releasing code, model checkpoints, curated synthetic datasets, and an evaluation benchmark under the Linux Foundation's OpenMDW-1.1 License.

The report lists two open-source code repositories, Cosmos and Cosmos-Framework, along with a Cosmos 3 model collection on Hugging Face. It also names five open-weight checkpoints, including Cosmos3-Super, Cosmos3-Nano, Cosmos3-Super-Text2Image, Cosmos3-Super-Image2Video, and Cosmos3-Nano-Policy-DROID.

NVIDIA further says its post-trained Cosmos 3 models ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena, at the time the report was written. Those rankings are NVIDIA's claims in the report, not independently established in the supplied material.

The unresolved commercial question

The open question is what exactly "built on LanceDB" means in this deployment. The source material does not say whether NVIDIA used LanceDB's commercial service, open-source Lance or LanceDB components, or an internal integration built around the Lance format.

That distinction matters for how to read the win. A commercial deployment would say something about LanceDB's enterprise pull inside frontier-model infrastructure. An open-source or format-level integration would still be significant, but it would point more to developer adoption and architecture fit than to revenue.

Either way, the sharper takeaway is that data layout is becoming part of the competitive stack for physical AI. Cosmos 3 is presented by NVIDIA as an omnimodal model family. LanceDB is betting that the harder, less visible part of that ambition starts before training, when the system has to decide what data can be kept together, searched, filtered, and turned into a training run at scale.

Why it matters

What LanceDB is claiming

What NVIDIA confirmed in the report

The unresolved commercial question

Reader comments