
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Prince Canuma’s pitch for on-device AI is deeply personal — he started building toward local inference after his father lost his sight in 2020 and needed accessible tools that would work in Africa, where internet access and subscriptions were unreliable.
MLX has grown from an Apple-silicon array framework into a serious deployment layer — Canuma says it now has 1.5 million downloads, 4,000+ model ports, and day-zero support for frontier open models like Gemma 4 on Mac, iPhone, and iPad.
The talk’s core claim is that local models are now much bigger and more practical than people assume — he says MLX can run models in the hundreds of billions of parameters on-device, including Gemma 4 26B on an iPhone using storage with “reasonable speeds.”
Audio is the big unlock for real agents, not just chatbots — beyond vision, Neywa Labs built MLX Audio with speech-to-text, text-to-speech, speech-to-speech, and a sub-100ms custom model called Marvis so you can talk to your computer in a more “Jarvis” style.
The demos were all about real-time, offline capability — Canuma showed Roboflow’s RF-DETR object detection and background segmentation running live on his Mac with the internet turned off, then used MLX VLM to describe an image locally on a 96GB machine.
The roadmap points toward hybrid inference, robotics, and much longer local context — in Q&A, he said MLX currently uses the GPU rather than Apple’s Neural Engine, but he’s experimenting with Core ML hybrids and says Turbo Quant enables up to 1 million token context on-device with roughly 4x KV-cache reduction.
Canuma opens by asking who has tried running AI on a phone or MacBook, then immediately frames the talk as a way to “offload some of that subscription completely on device” so the only bill left is electricity. The emotional center lands fast: in 2020 his father became blind just as Apple shipped the M1, and Canuma made a promise to somehow “get you back to reading.” For him, cloud AI wasn’t enough because his dad lived in Africa, where connectivity and subscription economics made local-first systems feel necessary, not trendy.
He describes finding MLX on GitHub in 2023 and seeing it as the Apple-silicon equivalent of PyTorch or TensorFlow. What excited him was the sense that Apple was building for on-device intelligence while Meta and Google were still optimizing mostly for cloud scale. After contributing for a few years, he says MLX reached 1.5 million downloads, 4,000-plus model ports, and day-zero support for releases like Gemma 4.
The first big use case is vision, because blindness was the “biggest sensory deprivation” for his dad. Canuma recalls building hackathon goggles that narrated the world, then positions MLX VLM as the cleaner second version: just pull out an iPhone, point the camera, and ask what’s in front of you. He also notes the shift from simple vision models to omni models that can take image, audio, and text, which matters because for his dad, typing isn’t practical but speaking is.
Canuma says audio became the next frontier partly for accessibility and partly for a “very selfish reason”: he wanted to control his computer without sitting in front of it. That led from text-to-speech to Marvis, their custom model that generates audio in under 100 milliseconds, plus speech-to-text and speech-to-speech support in both Python and Swift. His point is that MLX isn’t just a model runner — it lets developers compose modular pipelines across ASR, LLM, and TTS depending on the hardware budget, from an old M1 to the latest Apple silicon.
The vision demos are straightforward but effective: real-time object detection with Roboflow’s RF-DETR, background blur with mask detection, and repeated reminders that everything is happening locally, even with the internet switched off. He jokes that the detector labels his glass of water as wine — “but I’m sober” — which gives the whole thing a lived-in vibe rather than a polished benchmark reel. He then loads Gemma 4 in MLX VLM’s chat UI and runs image analysis locally on a machine with 96GB of memory, stressing that all the showcased models can run simultaneously in real time on that setup.
When an audio demo misbehaves, he pivots to community examples: grounded visual reasoning for home security and dash-cam review, on-device cartoon video generation via MLX Video, and a native app called Locally that uses MLX Audio and Marvis TTS to speak back to users. The most forward-looking example is a Richie Mini robot powered by MLX vision and audio, including real-time voice cloning in a Jarvis-like style. His closing message is simple and ambitious: today, you can build agents that hear, see, and sound like you or someone you love, running fully on an iPhone, iPad, Mac, or robot.
In questions, Canuma explains that MLX currently runs on the GPU, not Apple’s Neural Engine, because using the Neural Engine cleanly requires Core ML and Apple’s APIs still make that hard. He expects WWDC could change that and hints at internal work on hybrid inference across both. He also recommends the Mac utility “MacTop” for real-time GPU monitoring, names Gemma 4’s omni variants and Qwen 3 Omni as current multimodal picks, and says one of the biggest recent breakthroughs is Turbo Quant: he implemented it publicly within 30 minutes of the paper, and claims it can cut KV-cache memory by 4x while enabling up to 1 million token context on-device.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.