Summary: The preprocessing step that splits raw text into discrete units (“tokens”) drawn from a fixed vocabulary, so that each unit can be mapped to a learned vector by the embedding matrix.
Why not just use words?
A word-level vocabulary is either too big (millions of distinct forms across a real corpus) or too small (no handling of typos, new words, rare proper nouns, other languages). Character-level is too fine (context windows become much shorter in “meaning per token”). Subword tokenization splits the difference: common words get a single token; rare words decompose into meaningful subpieces.
Example from the 3b1b walkthrough:
"To date, the cleverest thinker of all time was"
→ ["To", " date", ",", " the", " cle", "ve", "rest", " thinker", " of", " all", " time", " was"]
Note that spaces and punctuation typically live inside tokens — “ date” is one token with the leading space, not two.
Key consequences
- Context size is measured in tokens, not words. GPT-3’s 2,048-token window is roughly 1,500 English words.
- Vocabulary size is fixed at training time — GPT-3 uses ~50,257 tokens. The embedding matrix has one column per token, so growing the vocab grows the first and last layers linearly.
- Tokenization is lossless but not neutral. Different tokenizers produce different sequences for the same text, which means models with different tokenizers are not directly comparable token-for-token.
- Math, code, and non-English text tokenize less efficiently in tokenizers trained on mostly-English web text — more tokens per semantic unit, so effectively less context.
Algorithms (for later)
The 3b1b series doesn’t cover the algorithm itself. The standard modern choices are Byte Pair Encoding (BPE) and SentencePiece / Unigram. These are left as stubs to flesh out when a source covers them.