Back to Podcast Digest
AI Engineer24m

Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse

TL;DR

  • Langfuse built a skill because 478 pages of docs and stale pretraining context were breaking agent integrations — Marc Klingen says coding agents could “add Langfuse” from memory, but they often used outdated SDK methods and only fetched current docs after failing.

  • His core metaphor is a Rubik’s Cube: agents have the tool, but skills provide the manual — Klingen frames skills as the reliability layer between brittle workflows and fully autonomous agents, especially for multi-step, cross-domain tasks like password resets plus email changes.

  • Tracing got them “80% of the way” to improving the skill — By instrumenting Claude Code and reviewing execution traces, the team found concrete failure modes like hallucinated CLI flags, missing data-region prompts, and weak documentation navigation.

  • A simple search endpoint became both a better retrieval layer and a product feedback loop — Instead of forcing agents through hundreds of pages, Langfuse exposed its existing RAG-backed docs Q&A as a search endpoint, which also let the team log natural-language queries and see where users were getting stuck.

  • Even basic evals were worth shipping across wildly different app types — Langfuse created five starter evaluation setups and used LLM-as-a-judge checks on sample repos, file-system diffs, and trace outputs to make sure instrumentation changes didn’t silently regress.

  • Auto-improving the skill worked, but only as well as the target function — In an internal “auto research” experiment, the team accepted 3 of 6 suggested improvements, but Klingen warns that if you optimize for turns or speed alone, agents strip out the very doc-fetching and best-practice steps you actually need.

The Breakdown

Why Langfuse needed a skill in the first place

Marc Klingen, cofounder of Langfuse, opens with the company’s origin story: three years ago they were “building agents that didn’t work,” which pushed them into tracing and evaluation infrastructure. His framing is simple and memorable: agents are like getting a Rubik’s Cube as a kid — you can twist it all day, but without the manual, you’re just making new patterns, not solving it.

Skills as the middle ground between workflows and autonomy

He revisits the old “workflow vs. fully autonomous agents” fight and basically says both sides were right. Workflows are reliable but rigid; agents are flexible but messy. Skills, in his view, are the shortcut that lets agents handle multi-domain requests — like a customer wanting both a password reset and an email change — without forcing developers to hardcode every route in advance.

The real problem: 478 pages of docs and outdated model memory

Before building the skill, Langfuse faced a very practical mess: hundreds of documentation pages, lots of implementation flexibility, and coding agents relying on stale training data. Klingen says asking Claude Code to “add Langfuse” often looked impressive at first, but the model would implement old SDK patterns, fail verification, and only then pull fresh docs to repair the damage.

What the Langfuse skill actually changed

The goal was to give every user “a Langfuse expert” that could set up observability, prompt management, and evals with current best practices. The skill pushes agents to ask follow-up questions instead of charging ahead blindly, then progressively reveals the right docs and references. Underneath that, Langfuse’s API-heavy infrastructure turned out to be an advantage: once wrapped in a CLI, agents could do the same tasks humans used to click through manually in the UI.

The first big lesson: traces beat abstract theorizing

Klingen says reviewing traces still gets you “80% of the detail.” By instrumenting Claude Code and watching where it wandered, the team found specific issues fast: agents hallucinated CLI parameters, assumed Europe as the default data region, and skipped asking users what region they actually needed — which broke for US enterprises that also care about data locality.

Teaching agents how to navigate the docs

A lot of the work wasn’t new model magic; it was giving the agent a map. Langfuse used its LMS.txt-style sitemap, told agents to request markdown instead of bloated HTML, and exposed its existing RAG-based docs assistant as a search endpoint. That last part mattered twice over: agents got relevant chunks in one shot, and Langfuse got telemetry on the exact natural-language questions people were asking.

Basic evals were enough to make the system safer to change

The team initially got stuck because Langfuse users span everything from chat apps to invoice processing to real-time voice and video. Instead of solving evals perfectly, they shipped five basic setups and tested changes with LLM-as-a-judge checks over sample repos, file diffs, and expected trace outputs — like confirming OpenAI instrumentation was added or RAG retrieval spans appeared.

Auto-research worked, but the objective function quietly controlled everything

One of the most interesting experiments was using agents to improve the skill itself, especially for migrating prompts from Git repos into Langfuse prompt management. They accepted 3 of 6 suggestions, but the deeper lesson was that the target function decides what survives: if you optimize for fewer turns, the system strips out documentation checks; if you don’t include trace-linked prompt versioning, the agent treats that as disposable “garbage on the way.” He closes on a very current product question: should a skill aim for the quickest “aha” moment, or try to deliver the perfect setup in one shot, even if that means asking a lot more questions?

Share