AI Engineer·May 4, 2026·1h 21m

Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

TL;DR

You can train a tiny GPT-2-style model from scratch on a laptop — Angelos Perivolaropoulos says a 1.8M-parameter model with 6 layers, 6 heads, 384-dim embeddings, and 256-token context can run on a 16 GB machine or a free Google Colab T4 GPU in about 15 minutes.
The tokenizer choice is half the game when data is scarce — he uses a 65-token character-level tokenizer on Shakespeare because a full 50,000-token GPT-2 vocab would explode both data needs and parameter count, with embeddings alone jumping to roughly 19M parameters.
Most of the architecture is boring on purpose — and that’s the point — the workshop sticks to plain PyTorch and a decoder-only transformer with attention, MLPs, residuals, and layer norms to show that “80% of the way” to real model training is surprisingly compact and understandable.
Training quality comes down less to exotic pretraining and more to the loop around it — Angelos argues the big gains between models like Gemini 3 and 3.1 often come from smarter fine-tuning and post-training rather than radically different base architectures.
Loss curves tell a very human story of the model learning English one baby step at a time — around loss 4.17 the model knows nothing, near 3.3 it picks up character frequencies, near 2.5 it starts finding patterns like “th,” and around 1.0–1.2 it begins producing recognizable Shakespeare before overfitting below 1.0.
Reasoning and multimodality are extensions, not magic replacements — he says reasoning models often share the same base model and get their behavior from high-quality post-training data, while video and audio systems typically plug encoder-produced embeddings into the same transformer stack.

The Breakdown

A research engineer opens with the pitch: no pretrained crutches

Angelos Perivolaropoulos, who leads speech-to-text at ElevenLabs and worked on Scribe v2, frames the session as a real hands-on build: train an LLM from scratch with no pretrained weights, just PyTorch and a few basic libraries. He jokes he could go lower-level than Torch “but I don’t want to torch you that much,” and says this gets you roughly 80% of the way to how research engineers actually build models.

Why this workshop starts with Shakespeare and a tiny tokenizer

The project is inspired by Andrej Karpathy’s NanoGPT, but tuned for something attendees can realistically run locally. Instead of a modern BPE tokenizer, he uses character-level tokenization on Shakespeare, giving just 65 tokens; that keeps the number of possible bigrams to 4,225, which is small enough for a tiny dataset to actually cover.

The tradeoff: character tokenization is easy to train and bad at scaling

He explains the compromise clearly: character-level tokenization makes a toy model tractable, but it’s much worse at representing meaningful chunks like words and phrases. “Sky is blue” is easy to model as words; splitting it into characters makes attention work harder and convergence worse, which is why serious models use BPE-style tokenizers built from common patterns in huge corpora.

The transformer is simpler than people think

Angelos walks through the four building blocks: self-attention, MLP/feed-forward layers, residual connections, and layer norms. His tone is intentionally demystifying — you don’t need deep transformer theory to start training one, and a lot of modern progress is “more optimizations than necessarily reinventing the wheel.”

From config to code: a GPT in surprisingly few lines

The concrete model is classic decoder-only GPT-2 style: vocab size 65, block size 256, 6 layers, 6 attention heads, and 384-dimensional embeddings. He emphasizes how small the actual implementation is — a top-level GPT module, transformer blocks, an LM head, and a forward pass that converts token IDs into embeddings, adds positional information, and produces logits for the next token.

Training is where the real craft shows up

On the training side, he loads about 1 million characters of Shakespeare, slices 256-token sequences, and trains with batch size 64 using AdamW, warmup for 100 steps, and cosine decay out to 5,000 steps. He keeps coming back to the same point: training loops, validation loss, and data handling matter more than people think, and that’s often where model performance is really made.

Watching the model learn: from gibberish to fake Shakespeare

He gives a memorable progression of loss values so people know what “learning” looks like: random initialization starts near ln(65) ≈ 4.17, then the model picks up character frequencies, then fragments like “th,” then words, then recognizable names and phrases. In his run, the sweet spot was around 2,400 steps; after that, validation loss stopped helping and the model got less creative even if training kept improving.

Inference, seeds, and the leap to reasoning and multimodal models

For generation, he warns against greedy decoding because it makes LLMs dull, recommending temperature around 0.7 plus top-k sampling. In Q&A, he connects the tiny workshop model to bigger industry ideas: reasoning models often use the same base architecture but get their behavior from expensive high-quality post-training data, and audio/video models typically feed encoder-produced embeddings into the same transformer machinery rather than replacing it entirely.

LinkedIn X Email