Google's Gemma 4 12B brings encoder-free multimodal AI to laptops

Google says the Apache 2.0 model handles vision and native audio without separate encoders, runs with 16GB of VRAM or unified memory, and nears its 26B MoE model on benchmarks.

By Ryan Merket · Published Jun 3, 2026, 11:55am CT

Why it matters

Local multimodal models are becoming a distribution fight. If Google's claimed 16GB memory target holds up in developer use, Gemma 4 12B gives builders another Apache-licensed option for on-device assistants, vision workflows, and private inference without renting large cloud GPUs.

Google's Gemma 4 12B brings encoder-free multimodal AI to laptops — Google says the Apache 2.0 model handles vision and native audio without separate encoders, runs with 16GB of VRAM or unified memory, and nears its 26B MoE model on benchma

Google DeepMind product managers Olivier Lacombe and Gus Martins introduced Gemma 4 12B, a mid-sized open model designed to run multimodal and agentic workloads locally on laptops.

https://www.youtube.com/watch?v=Q5a7dAREbXM

Google says the Apache 2.0 model bridges its edge-friendly E4B and its more advanced 26B Mixture of Experts model, offering benchmark performance near the larger system while using less than half the total memory footprint. The company says Google's Gemma 4 12B model can run on consumer laptops with 16GB of VRAM or unified memory.

The key architectural change is an encoder-free approach to multimodal input. Rather than routing images and audio through separate encoders before passing representations into the language model, Google says vision and audio inputs flow directly into the LLM backbone. For vision, Google replaced the Gemma 4 vision encoder with a lightweight embedding module using a single matrix multiplication, positional embedding and normalizations. For audio, it removed the encoder entirely and projects raw audio into the same dimensional space as text tokens.

https://x.com/googlegemma/status/2062202706882883696

That makes Google's Gemma 4 12B model the company's first mid-sized Gemma model with native audio inputs. Google is pitching the release to developers building offline voice, vision and agentic applications, including demos that transcribe, format and translate voice inputs locally through its Google AI Edge Eloquent app.

Google says the broader Gemma 4 model family has crossed 150 million downloads. The new 12B release also includes Multi-Token Prediction drafters intended to reduce latency, and Google is making weights available through Hugging Face and Kaggle. It says developers can experiment through LM Studio, Ollama, Google AI Edge apps and LiteRT-LM, and build with tools including Hugging Face Transformers, llama.cpp, MLX, SGLang, vLLM and Unsloth.

The release is another push by Google to make capable multimodal AI usable outside the cloud, with local inference as the selling point: advanced reasoning, image and audio handling, and agent-style workflows on everyday hardware rather than dedicated server infrastructure.

Why it matters

Reader comments