AI EngineerJune 9, 202620m

GPU Cloud Deployment Without Leaving Your IDE — Audry Hsu, RunPod

TL;DR

Flash turns a Python function into a GPU endpoint: Audrey shows RunPod's Flash SDK wrapping an async Python function with a decorator so the GPU-heavy part runs in the cloud while the rest stays local.
The pitch is speed of iteration, not just cheap compute: Instead of commit, push to GitHub, build Docker, pull from a registry, and provision a GPU every time, Flash hot-reloads file changes straight from the IDE.
RunPod has grown fast from a Reddit post to real scale: The company started in 2022 after founders Zennin and Pradeep repurposed spare crypto-mining GPUs, and Audrey says it now spans 30-plus data centers in 10 countries with $120 million ARR.
The live demo makes the case with model swapping in real time: Stable Diffusion XL Turbo produces ugly “abstract cats” for a London sky prompt, then Audrey comments out the code and switches to DreamShaper for a clearly better image.
Serverless is positioned for bursty, large-scale inference: Audrey explains that pods give you reserved GPUs, while serverless adds autoscaling and charges by request duration, with an H100 example priced at 0.00116 cents per second.
The bigger win is orchestration across models: Her final pipeline uses Qwen 3 to rewrite prompts, DreamShaper to generate images, and Nano Banana 2 to compose founder photos, showing Flash as a tool for stitching together multi-step AI workflows.

The Breakdown

RunPod says you can swap image models, push code to a GPU cloud, and test the result without ever leaving your IDE, cutting out the usual commit-build-Docker-deploy loop. In a live demo, Audrey Hsu goes from a hilariously bad “cats flying in London” image to a much better result, then chains Qwen 3, DreamShaper, and Google’s Nano Banana 2 into a multi-model pipeline.