Back to Podcast Digest
AI Engineer··43m

$1 AI Guardrails: The Unreasonable Effectiveness of Finetuned ModernBERTs – Diego Carpentero

TL;DR

  • LLM attacks are now the baseline, not edge cases — Diego Carpentero walks through direct prompt injection, indirect context attacks, RAG poisoning, MCP/tool abuse, and agentic exploits to show how modern systems get compromised without code execution or privileged access.

  • The core security flaw is architectural: LLMs don’t separate trusted instructions from untrusted data — his Sydney/Bing example shows how a Stanford student extracted Microsoft’s hidden system prompt with plain English because the model sees system prompt and user input as one document.

  • A cheap encoder can act as a practical guardrail layer — by fine-tuning ModernBERT on the 75,000-example InjectGuard dataset, he gets roughly 85% accuracy at 35 ms per classification in a self-hosted setup that he says can run for under $1.

  • ModernBERT is the point because its architecture matches the safety problem — alternating local/global attention, RoPE, sequence packing, and Flash Attention cut fine-tuning memory by about 70% while still handling up to 8,192 tokens of context.

  • Some of the scariest attacks now target decisions and actions, not just text outputs — he highlights ad review systems being manipulated by prompts embedded in websites, a 2025 Poisoned RAG result where 5 chunks in an 8 million-document corpus were enough, and a supply-chain attack that reportedly affected nearly 4,000–5,000 developers.

  • His big message is ‘trust nothing, verify everything’ for AI pipelines — minimum production checks should cover user input and model output, but ideally also RAG context, MCP/tool descriptions, memory, and agent plans because human reviewers only see the tip of the iceberg.

The Breakdown

From curious prompt leaks to a full-blown attack surface

Diego opens with the claim that what started in 2023 as people casually trying to exfiltrate system prompts has become a much more serious landscape, especially as LLMs get wired into identity workflows and agents. His framing is blunt: these attacks are no longer exceptions, they’re the baseline, so the job is to build a low-latency defensive layer that regular teams can actually afford.

The Sydney leak and why prompt injection keeps working

He starts with the classic direct injection case: Microsoft’s Bing Chat “Sydney,” where a Stanford student got the system prompt to spill using a plain-language query like “ignore previous instructions.” The reason it worked is the important part — the model effectively reads system prompt and user prompt as one continuous document, which means there’s no native separation between control instructions and untrusted input.

Indirect injection gets nastier when the model reads the world

Then he moves from prompt attacks to context attacks: malicious instructions hidden in websites, HTML, URLs, or inboxes that just sit there until an LLM fetches them. His examples are memorable: a Wikipedia page about Albert Einstein edited to push an LLM toward an attacker-controlled malware link, and a March 26 case where websites embedded prompts to manipulate AI ad-review systems into approving non-compliant content — in his words, the data being evaluated overruled the evaluator.

Breaking alignment with math, not language

The next shift is into model internals, where attackers append gibberish suffixes that look meaningless to humans but push the model out of its refusal region. He explains the greedy coordinate gradient method in plain terms: optimize suffix tokens so the model is nudged into starting with “Sure,” and once it starts positively, autocomplete takes over. The scary kicker is transferability: attacks found on open-weight models can carry over to closed black-box systems because similar training creates similar refusal boundaries.

RAG, MCP, and agents blow the attack surface wide open

For RAG, he cites the 2025 Poisoned RAG paper: in an 8 million-document knowledge base, poisoning just 5 chunks was enough to steer answers for a target query. Then he gets into MCP, where users approve a friendly-looking tool summary while the model sees a much richer description that can hide instructions to exfiltrate private keys or credentials as side-note parameters; he mentions follow-up work that pulled WhatsApp histories this way.

Agents don’t just believe bad instructions — they act on them

The agentic section is the most unsettling because the model now has permissions and tools. He describes the “Sybby AI” pattern where an agent sees a support-looking HTML page, clicks a link, downloads a file, changes permissions, and enables remote code execution; in another real supply-chain case from February, a malicious NPM package planted via a GitHub issue led a coding agent to install it, with nearly 4,000 to 5,000 developers reportedly affected.

Zero trust for LLMs, and why encoders are a practical defense

After laying out the threat model, Diego argues that LLM systems need classic zero-trust thinking: trust nothing, verify everything. His practical proposal is to use encoder models as discriminators at multiple checkpoints — not only on user prompts and model outputs, but ideally on retrieved context, MCP descriptions, memory, and agent plans — because relying on alignment or human reviewers alone creates what he calls an iceberg effect.

Why ModernBERT works, and what the build actually achieved

The last stretch is a technical but very grounded ModernBERT walkthrough: alternating attention like reading a page and then the whole book, 8,192-token context, unpadding and sequence packing to avoid wasting compute, RoPE for position without polluting semantics, and Flash Attention to keep work on fast on-chip GPU memory. Fine-tuned on InjectGuard’s 75,000 labeled examples from 20 sources, his ModernBERT setup hit nearly 85% accuracy with 35–40 ms latency, and in a quick demo it correctly flagged Sydney-style prompt injection, Wikipedia-style indirect injection, ad-review manipulation text, gibberish jailbreak suffixes, and MCP credential-exfiltration prompts as unsafe.