Latent SpaceMay 24, 202629m

⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind

TL;DR

Gemma 4’s trick is “effective” parameters — the E2B model has almost 5B total parameters but only 2B active on GPU, with the rest offloaded as per-layer embedding lookups so inference stays fast on constrained devices.
Google is already shipping Gemma-derived models on phones — Gemini Nano, which comes baked into high-end Pixel and Samsung devices, is built on top of Gemma and tuned for offline, privacy-sensitive on-device use cases.
Small open models are catching up faster than expected — Sanseviero says Gemma 4 roughly matches state-of-the-art from 1 to 1.5 years ago, making local agentic behaviors like function calling and system instructions practical even if world knowledge still favors larger Gemini models.
Multimodality is now viable even at 2B and 4B — Gemma 4’s smaller models can handle images, audio, and short 30-60 second videos, with speech recognition, speech translation, object detection, pointing, and captioning, though not yet image segmentation or fused video-plus-audio input.
Fine-tuning is losing ground as base models improve — some of Gemma 4’s 50-60 launch partners planned to fine-tune the 27B model, then skipped it because the model worked well enough out of the box; Sanseviero says demand is shifting toward domain-specific areas like finance and healthcare.
Google’s open strategy is as much ecosystem work as model work — despite a relatively small Gemma team, launches now involve nearly 50 external partners including llama.cpp, Ollama, MLX, Hugging Face, vLLM, Nvidia, and AMD, plus internal ties to Android Studio, Vertex, Kaggle, and DeepMind research.

The Breakdown

Gemma 4 packs nearly 5 billion parameters into a model that only loads 2 billion onto the GPU, a design Omar Sanseviero says is built for phones, Raspberry Pis, and offline AI that’s already shipping inside Pixel and Samsung devices. The bigger message from Google DeepMind: open models are getting good enough out of the box that many teams no longer need to fine-tune, while on-device multimodal agents are closing in fast.