The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
TL;DR
Model choice should come late, not first: Bhaumik says most teams start with GPT vs. Claude debates, but in one 8-week banking POC his team picked the model in week seven after building evals, tracing, and data plumbing first.
Three gaps kill production AI: He frames the recurring failures as an observability gap, an evaluation gap, and a governance gap, meaning teams cannot see what the system did, cannot measure what matters, and do not know who owns failures at 3 a.m.
Evaluation needs three layers: Deterministic checks catch formats and PII, semantic checks use LLM-as-a-judge for groundedness and relevance, and behavioral evals catch expensive issues like an agent making three duplicate database calls for one account-balance answer.
Data quality becomes unforgiving with agents: Bhaumik says humans tolerate bad data in reports, but agents do not, because they return wrong answers confidently, which is why he spends about 60% of project time on question data and tracking data foundations.
Tracing turns AI from a black box into an operable system: In a banking dispute flow, traces exposed every step from intent classification to policy retrieval to guardrails, and later helped diagnose a CSAT drop caused by an outdated policy document in the vector database.
Governance includes prompts, models, and incident response: He argues prompt versioning should follow enterprise change management, model updates must be tested against your own eval set, and AI incidents need a playbook of detect, diagnose, contain, and fix tied into ITSM systems.
The Breakdown
A retail bank burned $85,000 on a chatbot POC that looked great in demos and fell apart in production, until Sandipan Bhaumik rebuilt the project around five pillars and delayed model selection until week seven. His core point is blunt: enterprise agents fail less because of model choice than because teams skip evaluation, tracing, data foundations, orchestration, and governance.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.