AskwhoCasts AIJune 10, 202641m

How LLMs Actually Work - By 0xkato

TL;DR

LLMs never read words directly: They read token IDs from a fixed vocabulary, which is why classic failures like counting the R's in "strawberry" came from tokenization, not a simple inability to count letters.
Embeddings turn IDs into meaning, but position has to be added separately: A token like 1024 is just a row lookup until the embedding matrix gives it a vector, and modern models usually use RoPE instead of older sinusoidal or absolute position embeddings.
Attention is just learned matching between tokens: Query, key, and value vectors let a token like "was" look back at "cat" instead of "yesterday," and causal masking ensures GPT-style models only attend left to right.
Multi-head attention gives the model many parallel views of the same sequence: Heads are learned projections, not literal vector slices, and practical systems now often use grouped query attention, like Llama 2 70B with 64 query heads but only 8 key-value heads.
A lot of the model's stored knowledge seems to live in the feed forward network: The FFN holds most parameters in dense transformers, supports concept-specific activations, and can even be edited with methods like ROME to change facts such as "Eiffel Tower is in Paris."
Most frontier LLMs share one family skeleton: GPT, Claude, Gemini, Llama, Mistral, Gemma, and Qwen mostly differ in weights, data, scale, and post-training, while the common 2023 to 2025 stack converged on prenorm, RMSNorm, RoPE, SwiGLU, GQA, and sometimes mixture-of-experts.

The Breakdown

A modern LLM is mostly the same transformer stack repeated over and over: tokenize text into IDs, turn them into vectors, mix information with attention and feed forward layers, then predict one next token at a time. The big differences between GPT, Claude, Gemini, Llama, and others are less about a mysterious new architecture and more about trained weights, scale, configuration choices like RoPE or GQA, and post-training.