Shipping complex AI applications — Braintrust & Trainline
TL;DR
The hard part isn’t the model — it’s operational rigor — Jirean frames the whole workshop around the gap between a flashy GenAI POC and a production system, arguing that most failures come from weak workflows around tracing, evaluation, and iteration, not from GPT-5 mini or other models being “not smart enough.”
Trainline is running agentic AI at real consumer scale — Usama and Mayan say Trainline serves 27 million active users and roughly 6.3 billion train-ticket sales, with a live multi-agent travel assistant that can handle refunds, train changes, and human handoff rather than acting as a simple chatbot.
Trainline uses Braintrust to manage both quality and cost — Mayan explains that switching between OpenAI and Anthropic models to control spend was risky before they had offline and online evals, but Braintrust now lets them simulate lower-cost model swaps and verify that user experience holds up.
The workshop’s core idea is to decompose AI apps like software systems — Instead of one giant prompt, the demo support-triage app is split into stages for context collection, triage, policy review, reply writing, and escalation, which Jirean compares to breaking a monolith into microservices.
Tracing down to tool-call and token level is presented as table stakes — Braintrust is used to capture nested traces, latency, prompt and completion tokens, metadata, and waterfall timelines so teams can see not just what happened in logs, but how a multi-stage agent behaved step by step.
The production flywheel is evaluate, ship, observe, remediate, repeat — The team seeds a 10-case golden dataset, runs deterministic and LLM-as-a-judge scores, pushes prompts and tools into Braintrust-managed infra, applies online scoring to live traces, and then fixes a failure case involving a CFO invoice-export ticket that the model wrongly marked as low urgency.
The Breakdown
Why most AI projects stall after the demo
Jirean opens with a familiar pain point: loads of teams can build a cool GenAI demo, but far fewer can “industrialize” it. His point is blunt and useful — traditional software is deterministic, but LLM systems aren’t, so treating logs as if they’re enough is a category error; you need observability into behavior, not just a record of what happened.
Braintrust and Trainline set the stakes
He introduces Braintrust as a nearly three-year-old Series B company that recently raised $80 million at an $800 million valuation, backed by firms like Greylock and Andreessen Horowitz. Then Trainline’s Usama and Mayan make the workshop feel real: Trainline is not just selling tickets, but operating AI systems at scale across fragmented rail networks where there’s no single global equivalent of the airline stack.
Trainline’s AI systems are already in production, not in theory
Usama walks through Trainline’s two AI worlds: classic ML models, like predicting train disruptions “like a weather app for rail,” and newer agentic systems. The standout example is the travel assistant, which can proactively surface alternative trains, process refunds, and hand off to human support — all while serving 27 million active users and a huge live stream of conversations.
Quality at Trainline means combining software discipline with ML evals
This is where their framing gets sharp: deterministic systems on one side, non-deterministic ML on the other, and agentic systems sitting awkwardly in the middle. Mayan says model cost pressure forced them to keep switching providers and model sizes, and Braintrust became the way to test whether a cheaper model could replace a stronger one without quietly tanking the experience.
The demo app: from one-shot prompt to staged agent
Jirean’s hands-on example is a fake support-triage agent, but it mirrors the shape of lots of real enterprise workflows. He starts with a single LLM call, then adds local tools, then breaks the flow into specialist stages — collect context, triage, review policy, draft the customer reply, and decide whether to escalate — explicitly comparing the move to splitting a monolith into microservices.
Observability is where the app stops being a toy
Once the agent works, the workshop shifts to tracing every nested step inside Braintrust. The pitch here is practical rather than abstract: you can inspect parent and child spans, tool calls, token counts, latency, metadata, and timeline waterfalls to debug where an agent is slow, expensive, or wrong.
Golden datasets, scoring, and the first production loop
From there, they seed a 10-case dataset into Braintrust and run two kinds of scoring: deterministic checks for things like schema and escalation fields, plus LLM-as-a-judge rubrics for more subjective quality. Jirean’s message is that you can’t ship “on vibes”; even if you don’t have perfect coverage, you need a baseline that says what good looks like.
Managed prompts, online scoring, and fixing a real miss
The final leg moves prompts, tools, and parameters into Braintrust-managed infrastructure so non-engineers can safely tweak things like the baseline model without code changes. Then they apply online scoring to live traces and walk through a memorable failure case — a user says “this isn’t urgent” but also says the CFO can’t export invoices before a board meeting, and the model misses the business severity — before tightening the prompt, rerunning evals, and showing the score recover in the UI.