AI EngineerJune 28, 20265m

When All Context Matters: Extended Cache Augmented Generation - Luis Romero-Sevilla, Orbis

TL;DR

Simple RAG fails when all documents are relevant: vector databases retrieve only documents within a similarity threshold, but some scenarios require synthesizing answers across an entire collection.
GraphRAG is too slow for rapidly changing data: recomputing a knowledge graph every time documents get replaced is computationally expensive and time-consuming.
Cache Augmented Generation (CAG) hits context limits: loading all documents into a model's context window degrades answer quality when the window fills up.
Extended CAG distributes documents across parallel buckets: each bucket caches its own KV matrix, and a supervisor model interrogates the right buckets to synthesize answers.
Random distribution beats domain categorization: when documents have dense interconnections, organizing by domain causes supervisors to ignore seemingly irrelevant categories that actually matter.
No retrieval strategy fits all problems: each approach has trade-offs in compute, cost, and speed, so the right solution depends on the specific problem constraints.

The Breakdown

When every document in a collection matters for answering a question and the data turns over rapidly, traditional RAG and GraphRAG both fail. Luis Romero-Sevilla introduces Extended Cache Augmented Generation, a parallel approach that distributes documents across multiple cached context buckets and uses a supervisor model to interrogate them.