TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
TL;DR
Google sees two distinct on-device AI tracks — Cormac Brick says mobile is splitting into OS-level “system genAI” with 2B–5B models like Gemma 4 candidates, and in-app “tiny LLMs” under 1B params that are usually fine-tuned for narrow tasks.
Gemma 4’s big unlock is on-device skills, not just better chat — the E2B and E4B models add built-in thinking plus function calling, which let Google’s AI Gallery app load lightweight skills like mood journaling, maps, Wikipedia, and even music generation on Android and iOS.
The trick for small edge agents is token efficiency — instead of stuffing every tool into context MCP-style, Google only shows the model one-line skill descriptions first, then lets it call
load_skillto fetch full instructions on demand.Tiny models get surprisingly good when you fine-tune hard for one job — Brick says Google’s 270M Function Gemma jumped from roughly the 40% range to about 86% reliability on a 10-function mobile action benchmark after fine-tuning, with some simple functions hitting ~93%.
Google AI edge tooling is becoming a real deployment stack — LiteRT-LM now runs across Android, iOS, desktop, web-adjacent targets, and IoT, with CPU/GPU portability from one file and specialized AOT compilation for NPUs like Qualcomm, Intel, and MediaTek hardware.
Google’s own shipping app pattern is modular, not monolithic — in the iOS-only AI Edge Eloquent app, speech recognition and text polishing are separate Gemma-derived tiny models, showing how production edge apps can chain specialized models instead of forcing one model to do everything.
The Breakdown
From Raspberry Pi demos to Google’s edge AI roadmap
Cormac Brick opens with the credibility story: he’s spent about a decade on edge AI, from running GoogLeNet on a USB accelerator attached to a Raspberry Pi at NeurIPS 2016 to leading Intel’s laptop NPU architecture and now tech-leading Google AI Edge. His framing is simple: edge AI matters because of latency, privacy, offline use, and cost, with Pixel’s live voice translation as the kind of thing that just doesn’t work well enough if you bounce everything through the cloud.
The split: OS-level models vs tiny in-app models
The first big idea is that mobile AI is bifurcating. “System genAI” means a larger shared model baked into the OS — think Android AI Core or Apple Intelligence — while “in-app genAI” means tiny models shipped with the app itself for broad reach, including non-premium devices. Brick says these tiny models can be shockingly capable for narrow jobs like summarization, transcription, and voice-to-action, but below roughly 500M parameters, fine-tuning is usually the difference between a neat demo and production reliability.
Why Gemma 4 is a serious edge model now
Brick then digs into Gemma 4, especially the E2B and E4B variants optimized to keep only about 2B or 4B effective parameters resident in RAM. The practical win isn’t just quality — it’s that the models have built-in function calling and built-in thinking, plus multimodal support on the smaller variants, which makes them realistic candidates for mobile and embedded agent workflows. He also calls out a deployment-friendly change: this is the first Gemma release under a plain Apache 2.0 license.
Performance across phones, laptops, Raspberry Pi, and Qualcomm hardware
On runtime, the pitch is portability: one LiteRT-LM package can run across Android, iOS, macOS, Linux, Windows, web targets, and IoT, usually on CPU or GPU from the same artifact. For NPUs, Google uses a special ahead-of-time compiled artifact, but the developer-facing API stays mostly the same. Brick flashes performance numbers that range from high-end phones and Macs doing thousands of tokens per second down to Raspberry Pi at about 133 tokens/sec, which he says is already enough for simple image analysis workloads.
AI Gallery’s skill system: low-code agents on your phone
The liveliest part of the talk is Google AI Gallery, an open-source app for Android and iOS where Gemma 4 can trigger “skills” like mood tracking, maps, Wikipedia lookups, flashcards, or even mood-based music generation. The key detail is that these aren’t custom model fine-tunes — they’re mostly skill descriptions plus a little JavaScript, and the model figures out when to invoke them through free text. Brick’s excitement is obvious here: the team apparently built around 80 skills internally because the barrier was so low, and more than half were initially “vibe-coded.”
The architecture hack that makes small-model agents workable
Under the hood, Google avoids bloating the prompt by only exposing one-line skill descriptions at first. If the model wants one, it calls load_skill, pulls in the full skill.md, then can run JavaScript or native intents; Brick compares this to progressive disclosure or conditional depth. He says this matters a lot for small edge models, where too much context hurts reliability, and adds that strict constrained decoding for tool calls was a major quality boost, especially around the 2B class.
Tiny model workflow: fine-tune, quantize, deploy everywhere
The second half shifts from Gemma 4 agents to sub-1B “tiny models.” LiteRT-LM now has C++, Java, incoming Swift, and newly added Python APIs, while LiteRT Torch handles export, quantization, and edge-specific optimizations. Brick emphasizes this isn’t Gemma-only: Google also supports third-party models like Qwen and Apple’s 500M FastVLM, which he shows running very fast on Qualcomm hardware and making the case that narrow-but-useful multimodal models are already practical to ship.
A real app example: AI Edge Eloquent cleans up your speech
He closes with AI Edge Eloquent, an iOS transcription app built for people who speak like, well, humans — with “ums,” false starts, and domain-specific jargon. The app uses one model for raw ASR and a second Gemma-derived tiny model for “text polishing,” plus a personalization dictionary for names and technical terms like “LoRA.” Brick says this is the pattern they keep seeing: use a strong cloud model to generate lots of synthetic training data, then fine-tune a tiny on-device model for a narrow feature that can ship widely and cheaply.