Summary: A non-technical overview of what a large-language-model is, how it’s trained, and why transformers changed the game.
Key ideas
- An LLM is a function that maps a sequence of text to a probability distribution over the next token. Generation is then repeated sampling: feed the running text back in, sample a token, append, repeat.
- Even though the model is deterministic given a prompt, sampling from the distribution (rather than always picking the argmax) is what makes outputs look natural. This is why the same prompt gives different answers each run.
- Chatbot outputs work by prepending a system-like preamble describing a helpful AI assistant, appending the user’s message, and running next-token prediction on the whole thing.
- Scale. GPT-3 was trained on text a human would need ~2,600 years of nonstop reading to get through. Training involves hundreds of billions of parameters and would take >100 million years at 1 GFLOP/s.
- Training loop. All but the last token of an example is fed in; backpropagation nudges parameters to make the true last token slightly more likely. Repeated over trillions of examples.
- Two training phases: pretraining (next-token prediction over internet text) and rlhf (humans flag unhelpful/unsafe responses; the model is tuned to prefer helpful ones). The pretraining objective differs from the chatbot goal, which is why RLHF is needed.
- Pre-2017 language models processed text one word at a time (RNN/LSTM-style). The 2017 transformer processed all tokens in parallel, unlocking GPU utilisation and thus scale.
On transformers (preview)
- Each token is associated with a long vector (an embedding) that must encode meaning.
- Attention lets these vectors communicate and refine each other based on context (e.g. “bank” near “river” shifts toward riverbank).
- An MLP block after attention provides extra capacity to store learned patterns/facts.
- Data flows through many repeated attention + MLP blocks. The final vector is passed through one last operation (the unembedding + softmax) to produce the next-token distribution.
- Model behaviour is emergent from tuned parameters — researchers set the framework, not the specific rules the model ends up using, which is why interpretability is hard.