
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Ben Burtenshaw’s core claim is that coding agents are ready for “hard mode” AI engineering — not just app code, but CUDA kernels, LLM fine-tuning, and even multi-agent research workflows that touch real compute, benchmarking, and training jobs.
Custom CUDA kernels are no longer off-limits for agents — Ben points to GPU Mode hackathons, AMD events, and KernelBench as evidence, then shows Hugging Face Kernels as the missing distribution layer so generated kernels can actually be packaged, benchmarked, and used.
The bottleneck in deep learning is often memory, not math — his H100 example makes it concrete: roughly a petaflop of compute versus 3 TB/s memory bandwidth, which is why techniques like FlashAttention matter because they increase arithmetic intensity and “keep the GPUs warm.”
Skills are the practical trick that turns zero-shot engineering into few-shot engineering — Hugging Face bakes file-based, versioned skills into projects so agents can open benchmark scripts, examples, and usage patterns on demand, and Ben says this helped generate a Qwen 3 8B H100 kernel with a 94% speedup.
Hugging Face is positioning the Hub as infrastructure for agentic systems engineering — kernels, HF CLI skills, Jobs, Papers, Trackio, storage, and compute are all presented as open primitives that agents can orchestrate rather than black-box APIs they can’t see behind.
The most ambitious demo is an “automated AI lab” split into researcher, planner, worker, and reporter agents — inspired by Andrej Karpathy’s auto-research work, Ben’s AutoLab fans out literature search, hypothesis generation, code changes, training runs, and dashboard reporting into a parallel loop that can run for hours.
Ben opens by saying the argument is basically over: coding agents have been “accepted,” and the real question now is how engineers stay contemporary. His answer is to move closer to the silicon and use agents on tougher problems like AI systems engineering and ML engineering, framing the talk as three escalating “video game bosses.”
He starts with the once-heretical idea that agents can write optimized CUDA kernels, something many people considered too hardware-specific and too messy to benchmark. Ben says that assumption has largely broken, citing GPU Mode hackathons, the AMD hackathon, and KernelBench as proof that agents can produce valid, optimized kernels.
Ben pauses to explain the mechanics: kernels are the units of actual GPU work when running AI models, and optimizing them is about compute, memory, and overhead. The memorable punchline is that most people guess compute is the bottleneck, but on a modern GPU like the H100, memory often wins — the chip can do about a petaflop per second, but only move memory at 3 TB/s, so the goal is to “keep the GPUs warm” by doing more math per read, like FlashAttention does.
The practical problem isn’t just generating kernels — it’s distributing them, describing compatibility, and plugging them into inference. Ben presents Hugging Face Kernels as a Hub-native repo format with metadata about hardware and CUDA versions, then explains “skills” as file-based context that lets agents pull examples, benchmarking scripts, and usage docs when needed, turning a zero-shot task into a few-shot one.
He shows this isn’t just theory by describing a benchmark where a generated kernel for Qwen 3 8B on H100 got a 94% speedup, though he’s careful to say this isn’t some state-of-the-art universal result. His point is more tactical: there’s low-hanging fruit in hardware/model mismatches, and the open-source Upskill tool helps compare which models use a skill best, calling out examples like GPT-OSS, Kimi, and Haiku on accuracy versus token cost.
The second boss is much simpler: tell an agent to fine-tune a model like Qwen 3 6B on a chain-of-thought dataset, and let the Hugging Face stack handle the rest. Ben moves quickly here, pointing people to his colleague Merve’s deeper talk and to blog posts showing Claude and Unsloth-based workflows, emphasizing that this is already integrated with Hub compute and often comes with free credits to try.
The big finale is AutoLab, inspired by Andrej Karpathy’s recent auto-research project that had Claude iteratively improve nanoGPT training runs. Ben’s twist is to split the work across specialized agents: a researcher scouts papers via HF Papers, a planner turns ideas into a queue, workers implement training-script changes and launch HF Jobs, and a reporter monitors everything in Trackio.
Ben walks through how this looks in practice inside Open Code: agents use templates, branch off experiments, review stale or duplicate ideas, and run for hours while Trackio collects metrics, events, and warnings. His closing takeaway is blunt: agents work best with open primitives like Trackio’s parquet-backed data layer and Hub-native storage/compute, because opaque abstractions create ceilings, while well-exposed systems let agents actually engineer.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.