Back to Podcast Digest
AI Engineer19m

AI on Android: Ask me Anything — Florina Muntenescu & Oli Gaymond, Google DeepMind

TL;DR

  • Android now has a clear three-way AI stack: on-device, hybrid, and cloud — Florina Muntenescu frames the decision simply: use Gemini Nano via ML Kit GenAI APIs when available, fall back to Firebase AI Logic for hybrid inference, or go fully cloud with Gemini Flash/Pro when you need more reach or power.

  • AI Core is Google's answer to 'how do 100 apps share a 3–4 GB model without chaos?' — Oli Gaymond says Gemini Nano is delivered as a system service so apps share one optimized model, while Android handles isolation, scheduling, queuing, and foreground priority instead of every developer shipping their own giant model.

  • On-device AI is being positioned around privacy, offline use, and zero inference cost — the pitch is straightforward: prompts stay on the phone, sensitive data like banking info doesn't leave the device, and common use cases include translation, personalization, summarization, proofreading, and image understanding.

  • The biggest practical constraint today is hardware — the GenAI APIs currently target recent flagship-class devices like the Pixel 9/10 generation and similar OEM phones, while older or broader Android coverage still leans on cloud inference or custom deployment through LiteRT LM.

  • Battery and RAM aren't ignored, but Google thinks the real-world pattern is manageable — Gaymond says interactive usage like 10–20 AI requests per day isn't a major battery concern, while heavier batch jobs can be deferred to background or overnight charging, with AI Core doing the platform-level optimization.

  • RAG-style Android apps are possible now, and embeddings are 'soon' — when asked about vector search and note similarity, the team said developers can already build retrieval-like flows with the Prompt API, and an embedding API based on Gemini embeddings is planned to make that much easier.

The Breakdown

A fast reset on what 'AI on Android' actually means

Florina Muntenescu opens by trying to calibrate the room: who already knows how to build intelligent experiences on Android using on-device, hybrid, or cloud inference? When only a few hands go up, she gives a rapid-fire map of the stack: on-device for privacy, offline support, and no per-call cost; hybrid when local isn't available; and cloud when you need maximum capability.

Gemini Nano through ML Kit, with AI Core doing the heavy lifting

The on-device path they focus on is ML Kit GenAI APIs, which expose Gemini Nano, Google's efficient Android-optimized model. The key architectural point is AI Core: one system-level model shared across apps, optimized for device hardware so developers can think more like cloud users — focus on the feature and prompt, not packaging models, provisioning inference, or tuning silicon.

Prompt API is the star, and it's broader than just chat

Florina says ML Kit includes task-specific APIs like summarization, proofreading, and rewrite, but the Prompt API is the most flexible. Right now it supports text and image input with text output, which lets developers build image understanding, content assistance, content analysis, and entity extraction without inventing a whole stack from scratch.

Hybrid inference is Google's bridge for Android fragmentation

Because Gemini Nano currently lands on recent flagship devices — Florina cites Pixel 9, Pixel 10, and similar generations from other OEMs — Google is pushing Firebase AI Logic as the reach-expander. Their new hybrid inference setup lets apps use Gemini Nano locally when available and switch to cloud models like Gemini Flash otherwise, aiming for a more consistent user experience across the messy Android device landscape.

The first audience question goes straight to battery and RAM

Someone asks the practical question everybody wants answered: what does Gemini Nano or LiteRT LM do to memory and battery? Oli is candid — yes, nonstop local inference will drain battery — but says the real usage pattern they're seeing is short, user-triggered requests maybe 10 to 20 times a day, which hasn't been alarming, while bulk processing can be pushed to background or overnight charging.

Why Google wants the model in the system, not bundled in every app

A developer asks what happens when, in two years, 100 apps all hit the same shared on-device model. Oli's answer is basically the whole thesis of AI Core: the smallest useful models are around 1 GB, the shipped setup is closer to 3–4 GB total, so centralizing that cost at the OS level is the only sane way to scale — and Android can queue requests, prioritize the foreground app, and attribute battery impact the way it already does for GPS or Wi-Fi.

What this is not: it's not the Gemini app, and it's not 'skills' out of the box

Another attendee asks whether default Android assistant behavior is local or remote, and whether they can install this stack and improve answers with 'skills.' The speakers draw a clean line: consumer Google experiences like Gemini app integrations are a different user journey, while these APIs are lower-level building blocks for developers to create their own app experiences, potentially with tools like PocketPal/OpenClaw composing prompt logic on top.

RAG, embeddings, and the split between easy mode and frontier mode

In the closing minutes, the conversation turns to RAG-like apps, image inputs, and device/model compatibility. Oli says Prompt API already supports text-plus-image input for things like summarizing your photos into notes, embeddings are coming soon, AI Edge Gallery is there to show the frontier, and AI Core is the production-friendly path where Google guarantees that if the API is available on a device, it should run well.

Share