Back to Podcast Digest
AI Engineer··1h 47m

Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind

TL;DR

  • Google is steering developers toward the new Interactions API — Philipp Schmid framed it as the successor to generateContent, with server-side state, built-in agents like Deep Research, SSE streaming, remote MCP support, and async/background execution designed for longer-running agent workflows.

  • Server-side state is the practical win, not just cleaner syntax — by passing previous_interaction_id instead of replaying the full chat history, developers get simpler multi-turn agents and much better implicit caching, which Schmid said is making startup users see 2–3x higher cache hit rates and input token costs that can be about 90% cheaper on cache hits.

  • Their coding-agent demo was intentionally built with AI coding agents, not by hand — using Google’s new 'skills' for tools like Cursor, Gemini CLI, and 'anti-gravity,' Schmid had Gemini 3 Flash generate a Python agent that could chat, read/write files, and run bash commands, including creating a thumbs-up SVG and writing files to disk.

  • The Live API pitch is native audio, multimodality, and interruption-friendly conversation — Thor Schaeff described Gemini 3.1 Flash Live as a stateful WebSocket model that can ingest text, audio, and video (up to 1 frame/sec), handle 97 languages in preview, support barge-in, and use tools like Google Search grounding and custom function calls.

  • The most memorable demo was a voice DJ that turns conversation into songs — Schaeff combined Gemini Live with the Lyria 3 music model to generate things like a 'German techno-schlager about the UK AI scene,' complete with British-radio-host banter and a crowd-pleasing line about TPUs enjoying applause.

  • The workshop also made the current limits very visible — live Google Search grounding repeatedly failed during the weather demo, and both speakers were candid that while companies like Shopify, Waymo, Stitch, and Argentina-based Hey Ado are using this tech, native-audio voice apps still have rough edges around observability, transcripts, enterprise controls, and production reliability.

The Breakdown

A multilingual room, a secret API key, and the workshop kickoff

Thor Schaeff and Philipp Schmid open with a playful riff about being German, scan the room for languages, and quickly turn the crowd’s diversity into a live stress test for Gemini. The first practical move is getting everyone onto ai.dev to create a Gemini API key — free tier, no credit card — with the recurring reminder that an API key is in fact secret, unlike the one Philipp briefly exposes on screen and promises to delete.

Why Google wants developers on Interactions API now

Schmid lays out the bigger shift: Interactions API, launched in beta in December, is meant to replace generateContent with something more agent-friendly and more familiar to developers coming from OpenAI or Anthropic. His big pitch is a unified surface for models and agents — the same interface can call Gemini 3 Flash, Deep Research, image generation, audio, and eventually video — without the old proto-heavy Google feel.

The real advantage: server-side memory and better caching

The most useful technical point is state management on the server: instead of sending the whole conversation every turn, you pass previous_interaction_id and keep going. Schmid says that’s not just ergonomics; because the server preserves exact context, implicit caching survives small prompt edits that would normally break it, and teams using the API are seeing 2–3x better cache rates, with cached input tokens around 90% cheaper.

Vibe-coding a coding agent with Gemini ‘skills’

Rather than hand-code, Schmid leans into the moment: use your preferred coding agent and install Google’s Gemini skills package, which works with Cursor, Gemini CLI, and anti-gravity. He explains the design philosophy well — skills should teach the model things it won’t reliably know, like current Gemini model names or API usage patterns, while linking out to live docs instead of stuffing stale documentation into the prompt.

Building the agent: chat, files, bash, and a thumbs-up SVG

The demo escalates fast: Gemini 3 Flash generates a Python Agent class, a run method using interactions.create, and then file tools with JSON schemas for read_file and write_file. Once the loop is wired up, the agent writes hello_agent.txt, reads it back, and later creates a thumbs-up SVG — after Schmid notices the model needed stronger system instructions like 'you are an expert software engineer' before it reliably acted like a coding agent.

The Live API handoff: native audio instead of text-stitching

After the break, Schaeff takes over with Gemini 3.1 Flash Live, emphasizing that this is a native audio model, not a pipeline of speech-to-text, LLM, and text-to-speech. He highlights the key features in a very builder-friendly way: stateful WebSockets, real-time audio/video input, multilingual support across 97 languages in preview, interruption handling, and support for built-in Google Search grounding plus custom tools.

The standout demo: a bilingual radio DJ that generates songs

The room’s most fun moment is 'Live Jukebox DJ,' a voice app built in Google AI Studio that uses Gemini Live as the host and Lyria 3 as the music-generation tool. Schaeff plays the BBC-radio-presenter role to the hilt while the system happily turns a request for a 'high-energy German techno-schlager about the AI scene in the UK' into an actual song, later fielding a Swahili nursing-techno request with the same chaotic enthusiasm.

Great ambitions, very real demo gods, and honest production caveats

The ending is refreshingly candid: Schaeff tries to show Irish-accent prompting, client-to-server and ephemeral-token setups, and live Google Search grounding for London weather, but the grounding repeatedly fails while custom tools partly work and then stop working too. In Q&A, both speakers acknowledge the tradeoffs — native-audio apps feel magical, but today serious business deployments may still prefer cascading pipelines for observability, transcript retention, compliance, and tighter control, even as companies like Shopify, Waymo, and Hey Ado push the category forward.