The Week Open Weights Went Multimodal

An open-weight release wave hit every layer of the AI stack at once: language, agents, image generation, speech, music, OCR, video, 3D and physical AI.

By Ryan Merket · Published Jun 6, 2026, 3:00am CT

Why it matters

Open-weight AI is moving from text models into deployable multimodal systems. If the reported releases hold up, founders get more options outside closed APIs, but the real test is licenses, inference cost and benchmark reproducibility.

The Week Open Weights Went Multimodal — An open-weight release wave hit every layer of the AI stack at once: language, agents, image generation, speech, music, OCR, video, 3D and physical AI.

There are busy AI weeks, and then there was this one.

In a single stretch, open-weight AI stopped feeling like a category mostly defined by chat models. The release board filled up across almost every modality that matters: frontier-scale LLMs, laptop-scale multimodal models, image generators with actual design taste, multilingual TTS systems, real-time music models, streaming ASR, document parsers, joint audio-video generators, 3D reconstruction models and world models for physical AI.

Depending on how you count base models, instruction variants, quantized checkpoints, runtime ports and model-family releases, the week easily crossed 25 notable open-weight or openly distributed drops. I would be careful calling it an official record without a longitudinal release database. But as a lived industry moment, it felt like one: the kind of week where every refresh produced another model card, another paper, another X thread, another Hugging Face repo, another runtime integration.

The deeper story is not just volume. It is convergence. Open models are moving in three directions at once: up toward frontier-scale reasoning, down onto phones and laptops, and sideways into every media type that used to require separate proprietary systems. RuntimeWire had already framed NVIDIA's Nemotron 3 Ultra as part of the agent stack fight and Google's Gemma 4 12B as a local multimodal distribution story. Those two poles, giant agent models and laptop-native multimodal models, defined the whole week.

The scoreboard

Here is the week in one view.

Area	Release	Why it mattered
Frontier LLMs	NVIDIA Nemotron 3 Ultra	550B total parameters, 55B active, hybrid Mamba-Attention MoE, 1M context, open weights, data and recipes.
Local multimodal	Google Gemma 4 12B	Encoder-free multimodal Gemma model built to run on 16GB VRAM or unified memory, with Apache 2.0 weights and broad runtime support.
Agentic VLMs	StepFun Step-3.7-Flash	198B sparse MoE VLM, about 11B active, 256K context, SWE-Bench PRO 56.3, Apache 2.0.
Edge LLMs	Liquid AI LFM2.5-8B-A1B	8B on-device MoE with 128K context, 38T-token training and day-one llama.cpp, MLX, vLLM and SGLang support.
Code models	JetBrains Mellum2-12B-A2.5B-Thinking	JetBrains' first open MoE, 12B total and 2.5B active per token, built for code and natural-language workflows under Apache 2.0.
Image generation	Ideogram 4	Ideogram's first open-weight foundation model, a 9.3B DiT with strong typography, structured JSON prompting and top open-weight positioning on the company's reported design evals.
TTS	Boson Higgs Audio v3	4B-class expressive TTS across 100-plus languages with emotion, style, prosody and sound-effect control.
TTS	RedNote dots.tts	2B continuous end-to-end TTS pipeline without discrete codec tokens, released under Apache 2.0.
Music	Google Magenta RealTime 2	Open real-time music model and inference engine for MIDI, audio and text control, with about 200ms control latency.
ASR	NVIDIA Nemotron 3.5 ASR	600M streaming ASR model for more than 40 language locales, with NVIDIA claiming major concurrency gains versus Parakeet RNNT 1.1B.
Document AI	PaddleOCR-VL-1.6	1B-parameter document parser with 96.33% on OmniDocBench v1.6 and Apache 2.0 licensing.
Audio-video	Baidu NAVA	6.3B joint audio-video generator with 720p generation, speaker/timbre controls and Apache 2.0 release.
World models	NVIDIA Cosmos 3	Open physical-AI foundation models spanning text, image, video, audio and action trajectories.
Long video	JD JoyAI-Echo	Multi-shot audio-video generation aimed at stories up to five minutes, with cross-modal memory and open weights for research/non-commercial use.
Video editing	ByteDance Bernini-R	Open-sourced inference code and weights for the Bernini renderer, part of ByteDance's latent semantic planning framework for video generation and editing.
3D	VAST/TripoSplat	Single-image-to-3D Gaussian splats, open-source under MIT, with day-zero ComfyUI support.

The LLM week: huge, sparse, local and specialized

The headline model was NVIDIA Nemotron 3 Ultra. On paper, it is the kind of release that would normally dominate an entire week by itself: 550B total parameters, 55B active per token, hybrid Mamba-Attention MoE architecture, 1M context, 20T text-token pretraining and both BF16 and NVFP4 paths. NVIDIA's technical report says the family publishes not just checkpoints, but also training data and recipes on Hugging Face, which matters because "open weights" alone is no longer the highest bar for infrastructure-minded developers.

RuntimeWire covered Nemotron twice. The first piece, published as NVIDIA teased the release, framed Ultra as part of NVIDIA's climb from GPU supplier into full-stack AI platform. The follow-up noted Artificial Analysis' claim that Nemotron 3 Ultra had become the strongest U.S. open-weight model on its Intelligence Index, while also making clear that this was a benchmark-group claim rather than a universal independent standard. That caution is the right posture. Nemotron's importance is not one leaderboard number. It is the combination of scale, openness, deployment tooling and NVIDIA's control over the hardware layer.

Gemma 4 12B was the other anchor release, and it mattered for the opposite reason. It was not the biggest model of the week. It was the most deployable general-purpose one. Google described Gemma 4 12B as a unified, encoder-free multimodal model that can run locally on consumer laptops with 16GB of VRAM or unified memory. It handles text, vision and audio inputs, replaces heavier multimodal encoders with lighter projection-style components and ships under Apache 2.0. RuntimeWire's coverage nailed the strategic angle: local multimodal models are becoming a distribution fight.

Then Google followed with a quantization wave. The Gemma 4 QAT release included Q4_0 and mobile-format checkpoints, with Google saying the E2B mobile format can fit below 1GB and the broader QAT ecosystem works through Hugging Face formats, GGUF, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX and Unsloth. RuntimeWire also covered the more aggressive claim circulating around Ollama's QAT weights for Gemma 4 31B, but correctly treated the "near Claude Opus 4 on laptop hardware" framing as a claim that needs reproducible benchmark tables, hardware specs and throughput details.

StepFun's Step-3.7-Flash added a different flavor: a very large sparse VLM aimed at agentic workflows. The official repo lists it as a 198B sparse MoE VLM with a 196B language backbone, a 1.8B vision encoder, about 11B active parameters, 256K context, and benchmark claims including SWE-Bench PRO 56.3, Terminal-Bench 59.5 and Toolathlon/HLE tool-use results. It is Apache 2.0 and ships with multiple weight formats, including BF16, FP8, NVFP4 and GGUF.

Liquid AI's LFM2.5-8B-A1B was the edge-model standout. Liquid positioned it as an on-device MoE for fast, reliable tool calling on consumer hardware, with a 128K context window, a 128K-token vocabulary and training scaled to 38T tokens. RuntimeWire's write-up framed it as "edge AI shifting from demos to deployables," which is exactly right. The interesting part is not merely that an 8B-class model exists. It is that it arrives already wired into the runtimes developers actually use: llama.cpp, MLX, vLLM, SGLang, ONNX and Liquid's own LEAP path.

JetBrains Mellum2-12B-A2.5B-Thinking rounded out the LLM slate with a more vertical release. JetBrains says Mellum2 is a 12B MoE trained from scratch on natural language and code, with 2.5B active parameters per token and Apache 2.0 licensing. The positioning is practical: code assistance, RAG pipelines, routing, orchestration, sub-agents and private local deployment. It is not trying to be a giant generalist. It is trying to become a focal model in developer workflows where latency, throughput and cost matter.

The common theme: the open-weight LLM race is no longer "bigger or smaller." It is "right-sized and runtime-ready." Nemotron pushes the ceiling. Gemma pushes the laptop. Liquid pushes the edge. StepFun pushes VLM agents. JetBrains pushes code workflows. Those are different fronts in the same war.

The image surprise: Ideogram opened the taste layer

Ideogram 4 was the week's emotional shock.

Ideogram did not just release another text-to-image model. It released its first open-weight foundation model: a 9.3B single-stream Diffusion Transformer trained from scratch, using a VLM text encoder, structured JSON prompts, multilingual text rendering, bounding-box layout control, color-palette control and native 2K output. The official blog says Ideogram 4 ranked second overall behind GPT Image 2 in its internal designer-preference evaluation and ranked as the top open-weight model across several reported design and typography benchmarks.

This matters because image generation is not judged only by technical fidelity. It is judged by taste: typography, layout, negative space, hierarchy, brand feel, poster balance, packaging logic, visual wit. The open ecosystem has had strong image models before, but text-rich design has remained one of the places where closed systems felt meaningfully ahead. Ideogram opening weights changes that conversation.

There is an important licensing caveat. The GitHub repo is Apache 2.0, but the model weights are under the Ideogram 4 Non-Commercial license. So this is open-weight, not a permissive commercial release. Still, for researchers, indie artists, workflow builders and evaluation nerds, the release is a major event. It makes the design frontier inspectable.

NVIDIA PiD also deserves a spot in the image bucket. PiD, short for Pixel Diffusion Decoder, reformulates latent-to-pixel decoding as a conditional pixel-space diffusion model, unifying decoding and upsampling into one generative module. The model card says the released checkpoints include 2K and 2K-to-4K variants for several latent backbones, including Flux, SD3, SDXL and Qwen-Image. The license is non-commercial research/evaluation, but the technical idea is important: better high-resolution decoding may become a modular layer under many image models rather than a feature tied to one model family.

Audio broke out

Audio had the most surprising depth of the week. Four different release lanes moved at once: expressive TTS, continuous TTS, real-time music and streaming ASR.

Boson Higgs Audio v3 is the splashiest TTS release. Boson's blog describes it as an expressive speech model for voice chat across 100-plus languages, with zero-shot cloning and inline control over emotion, style, prosody, pauses and sound effects. LMSYS' SGLang-Omni write-up says Higgs Audio v3 uses a roughly 4B autoregressive decoder based on Qwen3-4B, supports interleaved text and audio tokens, and exposes controls for more than 20 emotions, singing, whispering and shouting. The self-hosted weights are under a research and non-commercial license, with commercial use requiring separate terms.

RedNote's dots.tts was the architecturally interesting counterpoint. It is a 2B fully continuous end-to-end autoregressive TTS pipeline that does not rely on discrete codec tokens. Instead, it combines a semantic encoder, an LLM initialized from Qwen2.5-1.5B and an autoregressive flow-matching acoustic head over a 48kHz AudioVAE. The project page and Hugging Face card highlight base, self-corrective aligned and MeanFlow-distilled variants, all under Apache 2.0.

Google Magenta RealTime 2 pushed into live music generation. Magenta describes it as an open model plus efficient real-time inference engine for playing AI instruments on a laptop, controlled by text, audio and MIDI. The Hugging Face card lists 2.4B and 230M variants, open weights under CC BY 4.0 and code under Apache 2.0. Magenta says the model supports continuous low-latency musical audio generation with about 200ms control latency.

NVIDIA Nemotron 3.5 ASR completed the audio stack from the recognition side. NVIDIA's model card describes it as a 600M streaming ASR checkpoint for 40 language locales, with punctuation and capitalization in the same checkpoint. NVIDIA reports cache-aware streaming behavior and claims roughly 17x more concurrent streams than Parakeet RNNT 1.1B at an 80ms chunk size.

This is why the audio week felt like a breakout. It was not one better TTS demo. It was the outline of a full open audio stack: speech in, speech out, expressive control, multilingual coverage, music generation and serving infrastructure.

Vision and documents: the boring workflows got a lot less boring

PaddleOCR-VL-1.6 is the kind of release that may be less viral than an image model but more useful in production. The Hugging Face card describes a compact 1B-parameter document parser with region-aware optimization and progressive post-training, reporting 96.33% on OmniDocBench v1.6 and improvements on tables, formulas, ancient documents, rare characters, seals, charts, text spotting and document layout tasks. It ships under Apache 2.0.

That matters because document parsing is where a lot of enterprise AI projects either become real or die. PDFs, invoices, forms, scanned tables, stamps, charts and mixed-language documents are not glamorous. They are also everywhere. A strong, compact, permissively licensed OCR/document model is infrastructure.

NVIDIA LocateAnything-3B also belongs here. It is a 3B vision-language grounding model for precise object localization, dense detection, GUI element grounding, text localization and pointing. NVIDIA says its Parallel Box Decoding predicts bounding boxes in parallel rather than token-by-token, improving throughput up to 2.5x versus prior approaches. The model is research and development only, but as a grounding layer for multimodal agents and physical AI, it is worth watching.

Audio-video generation moved from clips toward systems

Baidu's NAVA was one of the strongest underappreciated releases of the week. The model card describes it as a 6.3B joint audio-video generator that creates synchronized video and audio from a single prompt, with reference timbre control and image continuation support. It uses an "Align-then-Fuse" MMDiT design, supports 720p generation and dual-channel audio, and is released under Apache 2.0.

The key phrase is joint audio-video. Many video systems still treat audio as a later-stage attachment. NAVA is part of the push toward native A/V generation, where speech, timbre, lip motion, background audio and visual timing are generated as a coupled system.

JD's JoyAI-Echo pushed in a different direction: long, multi-shot audio-video stories. The Hugging Face card describes an inference-only release for minute-level multi-shot A/V generation, with a distilled DMD generator and paired cross-modal memory. JD says the memory bank helps preserve character appearance and voice timbre across up to five-minute videos, and the project is released for research and non-commercial use.

Meituan's LongCat-Video-Avatar 1.5 is another release that deserves attention. It is MIT-licensed and focused on audio-driven human video generation, supporting Audio-Text-to-Video, Audio-Text-Image-to-Video and video continuation. The model card says v1.5 swaps in Whisper-Large for smoother lip dynamics, targets production-ready stability, supports realistic and animated domains and offers 8-step inference through DMD2-style distillation.

ByteDance's Bernini-R added the video editing/generation angle. Hugging Face lists the June 1 release of inference code and model weights for the Bernini renderer, while the GitHub repo says the diffusers-format bundle includes Wan2.2 base components and Bernini-R transformer weights. It is Apache 2.0, which makes it one of the week's more permissively useful video releases.

World models and 3D: the physical AI layer keeps getting louder

NVIDIA Cosmos 3 was the broadest "world model" release of the week. NVIDIA described Cosmos 3 as an open physical-AI foundation model with mixture-of-transformers architecture and native reasoning/generation across text, image, video, ambient sound and actions. The Hugging Face collection lists Cosmos3 Nano and Super, including 16B and 64B model classes, with models that can generate video, image, audio and action commands from combinations of text, image, video and action-trajectory inputs.

This is NVIDIA's physical AI thesis in model form. The company is not just releasing chat models. It is building a stack for robots, autonomous systems, video agents, simulation and action-conditioned generation. Nemotron targets digital agents. Cosmos targets embodied ones.

VAST/TripoSplat handled the 3D side. ComfyUI's write-up says TripoSplat turns a single image into 3D Gaussian splats, ships with MIT-licensed weights and code, and received day-zero ComfyUI support. RadianceFields likewise described it as an open-source image-to-Gaussian-splat model useful for 3D previews, AR/VR workflows and 3D-to-2D guidance.

A single-image-to-3D release would have been a headline in many quieter weeks. In this one, it was part of the overflow.

The safety and quantization releases matter too

Two NVIDIA releases were easy to miss because the week was so loud.

First, Nemotron 3.5 Content Safety. NVIDIA describes it as a 4B multimodal guardrail model for text and images, covering 23 categories across 12 languages, with custom-policy and reasoning support. The model is based on Gemma-3-4B-it, has a 128K context window and is positioned for content moderation and safety workflows.

Second, NVIDIA's NVFP4-quantized Qwen3.6-35B-A3B checkpoint. RuntimeWire covered the release as a production-inference story: a pre-quantized, Apache 2.0 Qwen3.6-35B-A3B build targeting Hopper and Blackwell GPUs, with a context window up to 262K tokens and NVIDIA Model Optimizer used for NVFP4 quantization.

These are not as flashy as new foundation models. They may be just as important for adoption. Guardrails, quantization, runtime support and reproducible deployment paths are how open models become usable systems.

What this week says about the state of open AI

The week's lesson is not "open models are back." They never left. The better read is that open models are now filling the whole stack.

First, active parameters matter more than total parameters. Nemotron, StepFun, Liquid and Mellum all tell variations of the same story: bigger total models, fewer active parameters, more careful routing and more emphasis on throughput. Sparse models are becoming less exotic and more operational.

Second, local deployment is now a product requirement. Gemma 4 12B, Gemma QAT, Liquid LFM2.5, dots.tts and Magenta RealTime 2 all compete on the idea that good AI should run closer to the user. Sometimes that means a laptop. Sometimes a phone. Sometimes a DAW. Sometimes a car. The cloud remains the frontier-quality default, but local models are becoming a real architecture choice.

Third, "open" now needs footnotes. Apache 2.0, MIT, CC BY 4.0, non-commercial research licenses and custom model licenses all appeared in the same week. Ideogram 4 is open-weight but non-commercial. Higgs Audio v3 is self-hostable but commercially restricted. NAVA, StepFun, dots.tts, Mellum2, Gemma and Bernini-R are much more permissive. Any serious roundup needs to distinguish "inspectable," "research-usable," "commercially usable" and "full-stack reproducible."

Fourth, multimodality is no longer a feature. It is the default direction. Text-only models are still valuable, especially for code and agents, but the center of gravity is moving toward models that see, hear, speak, sing, parse documents, generate video, ground objects, preserve identity, control tools and simulate action.

Finally, the release cadence itself has changed. A week like this used to feel impossible because training was the bottleneck, distribution was slow and runtimes were fragmented. Now the release pattern looks different: model weights on Hugging Face, day-one MLX or llama.cpp ports, vLLM/SGLang support, quantized checkpoints, OpenRouter listings, NIM endpoints, ComfyUI nodes, ZeroGPU demos and X threads within hours.

That is why the week felt insane. Not because one lab shipped one dominant model, but because the open ecosystem behaved like a swarm.

The extra releases I would add to the original list

A few drops deserve to be folded into the roundup rather than left as footnotes.

NVIDIA PiD belongs in the image section. It is a high-resolution pixel diffusion decoder, not a general image generator, but it could become a useful modular layer for 2K and 4K image workflows.

NVIDIA LocateAnything-3B belongs in vision and physical AI. It is a grounding model, and grounding is one of the missing pieces between VLM chat and agents that actually click, point, inspect, localize and act.

NVIDIA Nemotron 3.5 Content Safety belongs in the stack section. Open models need open guardrails, and this one targets both text and images across 23 categories and 12 languages.

NVIDIA Qwen3.6-35B-A3B-NVFP4 belongs in the deployment section. It is not a new base model, but pre-quantized production-oriented checkpoints are part of what makes the open ecosystem move faster.

Meituan LongCat-Video-Avatar 1.5 belongs in the audio-video section. It is MIT-licensed and aimed squarely at production-ready audio-driven avatar video, with single-stream and multi-stream audio support.

Google Gemma 4 QAT deserves to be counted as its own release wave. The 12B model was the model story, but the quantized checkpoints are the distribution story.

Link desk for editors

RuntimeWire coverage

Official source pack

Bottom line

This was the week open-weight AI became impossible to summarize as "LLMs." The releases hit reasoning, coding, agents, images, speech, music, transcription, OCR, documents, video, 3D and world models. Some were permissive. Some were non-commercial. Some were giant. Some were built for laptops. Some were not new base models but crucial deployment artifacts.

Together, they captured the industry's current mood: the frontier is still expensive, but the frontier's shape is leaking outward. Every week, more capability moves from closed demo to downloadable weights, from cloud endpoint to laptop runtime, from research repo to production stack.

This week, it did not trickle. It flooded.