Back to Podcast Digest
AI Engineer18m

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

TL;DR

  • Superlinked built an open-source small-model inference stack because the market had a real gap between model serving and production infra — Filip Makraduli says existing tools helped you run models, but not handle routing, autoscaling, queuing, GPU provisioning, Prometheus/Grafana monitoring, and deployment end to end.

  • Small models matter because they fight 'context rot' in agent systems before the big model ever sees the data — citing Chroma’s research, he argues that preprocessing, filtering, taxonomy classification, named entity recognition, and tool-calling with small models can shrink context and improve downstream agent performance.

  • Throwing more GPUs at inference is the wrong mental model for small models — when models like Stella embeddings, rerankers, and GlyNER only take a few GB each, dedicating one GPU per model wastes memory, so Superlinked focused on hot-swapping models on a single GPU with least-recently-used eviction.

  • Supporting 'hundreds of models' is much harder than it sounds because BERT, Qwen, ColBERT, rerankers, and cross-encoders all behave differently under the hood — Makraduli walks through mismatches in FlashAttention, normalization, fused QKV, grouped-query attention, and positional embeddings like absolute lookup vs rotary.

  • Their core pitch is the 'yin and yang' of inference: model support plus infrastructure — he says inference is only useful if you both support fast-moving open-source models from Hugging Face and provide the cluster layer to run them, with API primitives like encode, score, and extract backed by KEDA autoscaling and spot-instance orchestration.

  • This talk doubles as a soft launch for SIE, the Superlinked Inference Engine — the repo has already been tested with partners including Chroma, Qdrant, Weaviate, and LanceDB, and Makraduli frames it as the practical way he closed his own blind spot around production inference.

The Breakdown

From FlashAttention confidence to an inference blind spot

Filip Makraduli opens with a little humility: he’d written a Substack post explaining FlashAttention, memory-bound vs compute-bound workloads, and felt pretty good about it — until people pointed out he’d missed the thing that makes models fast in the real world: inference. Instead of hand-waving past it, he treated it like a personal bug report and decided to learn by building.

Joining Superlinked to build the missing piece

That learning path led him to Superlinked, where he teamed up with infrastructure engineers to build an open-source inference engine for small models used in AI search and document processing. He presents the repo as a soft launch and notes it’s already been tested with Chroma, Qdrant, Weaviate, and LanceDB — not a toy, but something partners have actually kicked the tires on.

Why small models matter in agent workflows

His case for this category starts with “context rot,” pointing to Chroma’s research showing that quality degrades as context windows grow. The answer, he says, is to use small models upstream for preprocessing, filtering, named entity recognition, taxonomy classification, and tool-calling so agents get cleaner inputs and less token bloat; he mentions Andrej Karpathy’s graph-based knowledge base work as part of the same broader response.

The myth that inference is just 'add more GPUs'

Makraduli argues that the usual scaling instinct breaks down for small models because many only occupy a few gigabytes, so pinning one GPU to one model leaves expensive hardware sitting idle. His team built hot-swapping so multiple models can share a GPU, plus a least-recently-used eviction policy, which cuts waste and makes it easier to switch between tools like rerankers and retrievers on demand.

Serving a model isn't the same as productionizing it

Another misconception: inference is not just spinning up a server with vLLM, TGI, or an API wrapper. The hard part is everything around it — routing, autoscaling, queuing, monitoring, and provisioning GPUs — and he says there wasn’t an open-source stack that took teams all the way from model runtime to production deployment for this small-model use case.

The 'yin': supporting the messy reality of open-source models

His first half of the “yin and yang of inference” is model support, and he makes the point plainly: inference is worthless if you don’t support the models people actually want to use. With millions of models on Hugging Face and strong open-source results on narrow benchmarks like MTEB — plus examples like Gemma getting high Elo scores with low-parameter models — he says open source is no longer a compromise.

Why BERT, Qwen, and ColBERT break the dream of a universal engine

This is the most technical stretch of the talk: different models disagree on normalization, FlashAttention implementation, fused query-key-value projections, positional encoding, and output format. BERT, Qwen, ColBERT, cross-encoders, and rerankers all need different handling, so Superlinked re-implemented forward passes, added variable-length FlashAttention to avoid wasting compute on padding, and built support for oddballs like multi-vector late-interaction models.

The 'yang': cluster plumbing, autoscaling, and the SIE launch

The infrastructure side wraps those models in three API primitives — encode, score, and extract — then layers in routers, queues, GPU pools, spot instances, larger GPUs, and KEDA autoscaling driven by Prometheus metrics. His pitch is simple: users shouldn’t have to stitch model support and cluster ops together themselves; with SIE, models become config, deployment becomes Terraform apply, and the whole thing ships with Helm charts and Docker images. He closes by revealing the opening slide’s mystery background: a sinusoidal positional encoding visualization, which nicely ties the joke back to the technical core.

Share