← Index/On attention/05

How does attention actually work?

When people say a model 'attends' to a word, it sounds like a metaphor. It isn't. Attention is a precise, repeatable step: every token in a sentence quietly reads every other token, scores how much each one matters, and keeps a blend of what it found. Do that in parallel, many times over, and you have a transformer.

The short answer

Attention lets each word read every other word and keep what’s relevant. If a token is a three-quarter-word chunk, attention is how the chunks compare notes.

Three vectors per word

A query — what am I looking for? A key — what do I offer? A value— what I hand over if I’m chosen.

In the original model

heads, 64 dimensions each

First, three vectors.

Before any reading happens, each token’s embedding is pushed through three learned matrices, producing three vectors: a query, a key, and a value. The names are worth keeping. The query is what a word is looking for. The key is what a word advertises about itself. The value is the content it will pass along if another word decides to listen.

Picture a library where every book is also a reader. Each book holds up a label describing what it covers — that’s its key. Each book also walks the shelves with a question in mind — its query. A book reads the labels, finds the ones that answer its question, and copies down what those books contain — their values. No card catalogue, no librarian; just everything comparing itself to everything at once.

Because the queries, keys, and values all come from the same sentence, this is called self-attention: the words interrogate each other. The rest of the page is about how that comparison is scored and turned into an answer.

§ I

The whole thing, in one line

For the curious

All of it fits in a single line of algebra. You don’t need to solve it — just read it left to right, and each piece will tell you what it does.

QKᵀEvery query, dotted with every key — a grid of raw match scores, one for each pair of words.
÷ √dₖShrink those scores so the softmax keeps learning. In the original model √dₖ is 8.
softmaxTurn each row of scores into weights that are positive and sum to one.
× VUse those weights to blend the value vectors into the word's new meaning.

§ II

The attention matrix

Interactive

Here is one short sentence run through attention. Every row is a word; the colours along it show where that word looks and how strongly. Hover a word to light up its row — and switch heads to watch three different readers work the same line.

Attention headRows: where a word looks

Tracks word order — each word mostly looks one step back, the simplest thing a head can learn.

to →Thecatsatonthematbecauseitwastired

0.9

0.8

0.1

0.8

0.1

0.8

0.1

0.8

0.1

0.8

0.1

0.8

0.1

0.8

0.1

0.8

0.1

0.8

0.1

it attends most to because (0.80)

Hover or tap a word on the left to see where it looks. Brighter cells carry more weight; each row adds up to one.

§ III

Many heads, many jobs

What the heads do

Rather than run one big attention over the model’s full width, a transformer splits that width into several heads and runs them in parallel. The original model used 8 heads of 64 dimensions each — 8 × 64 = 512 — so the whole thing costs about the same as one full-width head, but each head is free to learn its own habit.

Concat, then mix

What multi-head really means

Each head attends on its own narrow slice, the slices are concatenated back together, and a final projection blends them into one result. Eight quiet specialists, then a single answer.

Where vs. what

Two circuits inside every head

Anthropic’s interpretability work splits a head into two parts: a query–key circuit that decides where to look, and an output–value circuit that decides what to copy once it gets there. The two are independent — a head can change its aim without changing its cargo.

Heads specialise

Grammar, position, reference

Probe a trained model and the habits are legible. Some heads track the previous word, some link a verb to its subject or an article to its noun, some resolve what a pronoun points back to. Many heads turn out to be redundant and can be pruned; the ones that survive are usually the ones with a clear job.

The pattern-finishers

Induction heads

The most striking habit is the induction head. Having seen “Mrs Potter” once, it spots the next “Mrs” and bets the following word is “Potter” again. That copy-the-pattern reflex, built from a pair of heads working together, is now thought to do much of the heavy lifting behind in-context learning.

The three heads in the matrix above are toy versions of exactly this: a previous-token head, a grammar head, and a head that resolves “it” back to “cat”.

§ IV

Position, and the quadratic price

Two consequences

Attention is order-blind

On its own, attention treats a sentence as a bag of words — shuffle them and the scores come out the same. So order has to be added by hand. The original transformer mixed in fixed sinusoids of different wavelengths; a learned version worked just about as well.

Most models today use rotary embeddings, or RoPE, which rotate each query and key by an angle set by its position. The neat trick: after the rotation, a query and key’s dot product depends only on the distance between the two words, not on where they sit in the sentence.

Everything attends to everything

That all-pairs comparison has a cost. The score grid holds one cell for every pair of tokens, so it grows with the square of the sequence length: double the words and you roughly quadruple the work. This is the wall behind finite context windows.

Clever engineering softens it. FlashAttention computes the exact same result while never writing the full grid to memory, which cuts memory use to linear in the sequence length. It buys room, not a different scaling law — the square is still there.

Sources & notes

Verified June 2026

Papers

The mechanism, the equation, multi-head attention, and the 512/8/64 hyperparameters are all from Vaswani et al., “Attention Is All You Need” (2017). The Annotated Transformer reproduces the equation as runnable code.
Rotary position embedding (RoPE) is from Su et al., “RoFormer” (2021). The linear-memory result is Dao et al., “FlashAttention” (2022).
Head specialisation and the QK/OV split come from Elhage et al., “A Mathematical Framework for Transformer Circuits” (Anthropic, 2021) and Clark et al., “What Does BERT Look At?” (2019). Induction heads are detailed in Olsson et al. (2022).

Notes

The weights in the interactive are illustrative: hand-built matrices that depict the documented head types above (previous-token, syntactic, coreference), not numbers pulled from one specific model. Each row is a genuine probability distribution that sums to one, so the heatmap behaves like real attention — but treat it as a diagram, not a measurement.
The 512-dimensional, 8-head figures describe the original 2017 transformer. Modern models vary widely; the shape of the mechanism is what carries over.
The example sentence and all prose here are written for Looking Glass. No third-party text is embedded on this page.
The quadratic-cost discussion connects directly to Concept 03, where the same pressure sets the size of the context window.

Loading…

Turning the page.

How does attention actually work?