What multi-head really means
Each head attends on its own narrow slice, the slices are concatenated back together, and a final projection blends them into one result. Eight quiet specialists, then a single answer.
When people say a model 'attends' to a word, it sounds like a metaphor. It isn't. Attention is a precise, repeatable step: every token in a sentence quietly reads every other token, scores how much each one matters, and keeps a blend of what it found. Do that in parallel, many times over, and you have a transformer.
Attention lets each word read every other word and keep what’s relevant. If a token is a three-quarter-word chunk, attention is how the chunks compare notes.
A query — what am I looking for? A key — what do I offer? A value— what I hand over if I’m chosen.
Before any reading happens, each token’s embedding is pushed through three learned matrices, producing three vectors: a query, a key, and a value. The names are worth keeping. The query is what a word is looking for. The key is what a word advertises about itself. The value is the content it will pass along if another word decides to listen.
Picture a library where every book is also a reader. Each book holds up a label describing what it covers — that’s its key. Each book also walks the shelves with a question in mind — its query. A book reads the labels, finds the ones that answer its question, and copies down what those books contain — their values. No card catalogue, no librarian; just everything comparing itself to everything at once.
Because the queries, keys, and values all come from the same sentence, this is called self-attention: the words interrogate each other. The rest of the page is about how that comparison is scored and turned into an answer.
All of it fits in a single line of algebra. You don’t need to solve it — just read it left to right, and each piece will tell you what it does.
Attention of Q, K and V equals softmax of Q times K transpose, divided by the square root of d sub k, times V.
Here is one short sentence run through attention. Every row is a word; the colours along it show where that word looks and how strongly. Hover a word to light up its row — and switch heads to watch three different readers work the same line.
Tracks word order — each word mostly looks one step back, the simplest thing a head can learn.
it attends most to because (0.80)
Hover or tap a word on the left to see where it looks. Brighter cells carry more weight; each row adds up to one.
Rather than run one big attention over the model’s full width, a transformer splits that width into several heads and runs them in parallel. The original model used 8 heads of 64 dimensions each — 8 × 64 = 512 — so the whole thing costs about the same as one full-width head, but each head is free to learn its own habit.
Each head attends on its own narrow slice, the slices are concatenated back together, and a final projection blends them into one result. Eight quiet specialists, then a single answer.
Anthropic’s interpretability work splits a head into two parts: a query–key circuit that decides where to look, and an output–value circuit that decides what to copy once it gets there. The two are independent — a head can change its aim without changing its cargo.
Probe a trained model and the habits are legible. Some heads track the previous word, some link a verb to its subject or an article to its noun, some resolve what a pronoun points back to. Many heads turn out to be redundant and can be pruned; the ones that survive are usually the ones with a clear job.
The most striking habit is the induction head. Having seen “Mrs Potter” once, it spots the next “Mrs” and bets the following word is “Potter” again. That copy-the-pattern reflex, built from a pair of heads working together, is now thought to do much of the heavy lifting behind in-context learning.
The three heads in the matrix above are toy versions of exactly this: a previous-token head, a grammar head, and a head that resolves “it” back to “cat”.
On its own, attention treats a sentence as a bag of words — shuffle them and the scores come out the same. So order has to be added by hand. The original transformer mixed in fixed sinusoids of different wavelengths; a learned version worked just about as well.
Most models today use rotary embeddings, or RoPE, which rotate each query and key by an angle set by its position. The neat trick: after the rotation, a query and key’s dot product depends only on the distance between the two words, not on where they sit in the sentence.
That all-pairs comparison has a cost. The score grid holds one cell for every pair of tokens, so it grows with the square of the sequence length: double the words and you roughly quadruple the work. This is the wall behind finite context windows.
Clever engineering softens it. FlashAttention computes the exact same result while never writing the full grid to memory, which cuts memory use to linear in the sequence length. It buys room, not a different scaling law — the square is still there.