Back to Podcast Digest
AI Engineer23m

Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

TL;DR

  • Google is pushing Gemma 4 into real on-device agent territory, not just chatbot demos — Chintan Parikh says the focus for edge is the 2B and 4B variants, with built-in function calling, native structured JSON output, and a new “thinking mode” for reasoning-heavy local workflows.

  • The practical pitch for edge AI is speed, privacy, offline reliability, and lower token bills — Parikh frames on-device inference as especially strong for camera filters, video calls, sensitive summarization, and hybrid architectures where developers want to reduce cloud cost without giving up capability.

  • Gemma 4 2B is aimed at phones, while 4B targets beefier devices like laptops and IoT — the 2B model lands around 1–2 GB RAM after quantization, while the 4B model needs more headroom but opens up heavier-duty local applications.

  • Google’s Gallery app is the hands-on showcase for what these models can actually do on device — it demos skills like querying Wikipedia, tracking mood and sleep from journal entries, pairing a breakfast photo with music, and chaining app-like workflows, all with sample code and an open-source repo developers can fork.

  • LightRT is Google’s cross-platform deployment story for edge AI, and it now reaches beyond TensorFlow — Parikh says the runtime accepts converted PyTorch and JAX models into the TF Lite format, then runs across Android, iOS, macOS, Linux, Windows, web, and IoT hardware.

  • Performance is the big unlock, especially with NPUs — Google claims NPU acceleration can deliver 3–10x gains for real-time workloads, shows boosts up to 13x in some cases, and says its runtime is up to 35x faster than llama.cpp on mobile, while also demoing Gemma running on a Raspberry Pi-powered robot built that same morning.

The Breakdown

From chatbots to edge agents

Chintan Parikh opens by framing the session around Google DeepMind’s Gemma 4, specifically the 2B and 4B edge-friendly models. His big thesis: the shift is from basic chatbot behavior to more autonomous agents with reasoning, tool use, and the flexibility to run directly on devices people actually ship.

Why run AI on device in the first place?

He races through the edge-AI greatest hits, but with a practical developer lens: latency for camera and video-call scenarios, privacy for sensitive summarization, offline support for bad connectivity, and cost control when cloud token usage starts getting painful. The tone is very much: you don’t have to choose edge or cloud forever — hybrid is the real opportunity.

What’s new in Gemma 4 edge models

Parikh breaks down the hardware envelope first: Gemma 4 2B uses roughly 1–2 GB of RAM after quantization, while 4B is better suited to laptops and IoT-class devices with more memory. The more interesting part is capability: built-in function calling, native structured JSON output without prompt hacks, and a visible chain-of-thought “thinking mode” that the Gallery app can surface.

The Gallery app as Google’s playground for builders

The Gallery app is the center of the demo strategy: a local playground meant to show what on-device Gemma can do and give developers sample code they can fork. Parikh emphasizes that it’s not just a polished demo — the app and many of the skills are open source, so the point is to inspire people to remix them into their own products.

The actual demos: Wikipedia, mood tracking, photos, and sound

He runs through several examples that feel more like tiny agents than chat prompts: a Wikipedia-querying skill, a journal tool that summarizes sleep and mood trends from entries like “I only got 8 hours of sleep and I’m looking forward to heading out with Amy today,” and a multimodal flow where a breakfast photo gets turned into a matching music vibe. He also shows a more complex sound-generation workflow and notes users can switch between CPU and GPU now, with NPU support coming.

LightRT is the runtime underneath all of this

Parikh then pivots from flashy demos to plumbing: LightRT is Google’s on-device framework, built on TensorFlow Lite but rebranded to signal support for PyTorch and JAX models too. His pitch is portability — developers can convert models into the TF Lite format and keep one cross-platform deployment path across Android, iOS, macOS, Linux, Windows, web, and IoT.

Tooling for real deployment, not just one-device demos

This is where he gets into the stack: LightRT Torch for conversion, LightRT LLM for language models, Model Explorer for inspecting graphs and tuning quantization, and AI Edge Portal for cloud-based benchmarking across Android device fleets. The underlying concern he calls out is one every mobile team has: not whether a model runs once, but whether it runs reliably on five-year-old phones too.

The Raspberry Pi robot and the performance claims

Near the end, he shows a tiny robot in the demo booth running on a Raspberry Pi CPU with LightRT LLM; after some blinking lights and a noticeable pause, its Sharpie antennas wiggle in response to a prompt. He laughs that they built it that morning and need to improve performance, but uses it to underline the bigger point: Google says NPU acceleration can bring 3–10x gains, some cases hit 13x, iOS reaches around 56 tokens per second, and compared with llama.cpp the runtime can be up to 35x faster on mobile.

Q&A: home cameras, distributed agents, and bring-your-own speech models

The audience pulls the talk back to real use cases: local face recognition for home security cameras, multi-node setups where small models escalate events to a higher-level agent, and swapping cloud audio-to-text APIs for open-weight on-device models. Parikh’s answers are pragmatic rather than polished — yes, local recognition is possible, Raspberry Pi could do it, distributed classifier-plus-agent setups are common, and if developers have the right open model and file format, Google’s stack should support it.

Share