Language models don't read words. They read tokens — roughly three-quarter-word chunks — and count them the way a meter counts water. One million is the number you'll hear most often. Here's what it actually holds.
One million tokens is about three-quarters of a million English words. Roughly 8 average novels, or 2700 paperback pages, or about 65 minutes of video, or 8.7 hours of audio.
About four characters, or three-quarters of a word. The word “tokenisation” is three.
A token is a fragment of text — sometimes a whole word, sometimes a piece of one. Models don’t read letters or even always words; they read these fragments, and count them relentlessly.
For English text, the rule of thumb is simple: one token is about four characters, or three-quarters of a word. A short word like “the” is one token. A long, unusual word like “tokenisation” might be three. Punctuation, spaces, and emojis all count too.
Tokens are how a model measures its appetite, and how you get billed. When a provider advertises a “1M token context window,” or charges $3 per million tokens, that number is what they mean.
≈ eight average novels, spine to spine
Stack eight ordinary paperbacks on a desk. That is the amount of prose a modern model can hold in its head at once, if you hand it every page.
Video and audio cost a different rate. Google’s Gemini models tokenise video at about 258 tokens per second by default, and audio at about 32 tokens per second. Other providers land in the same neighbourhood, with similar trade-offs between fidelity and cost.
Slide from a handful of tokens up to ten million. Every equivalent follows along.