3Blue1Brown — Large Language Models explained briefly

Summary: A non-technical overview of what a large-language-model is, how it’s trained, and why transformers changed the game.

Key ideas

An LLM is a function that maps a sequence of text to a probability distribution over the next token. Generation is then repeated sampling: feed the running text back in, sample a token, append, repeat.
Even though the model is deterministic given a prompt, sampling from the distribution (rather than always picking the argmax) is what makes outputs look natural. This is why the same prompt gives different answers each run.
Chatbot outputs work by prepending a system-like preamble describing a helpful AI assistant, appending the user’s message, and running next-token prediction on the whole thing.
Scale. GPT-3 was trained on text a human would need ~2,600 years of nonstop reading to get through. Training involves hundreds of billions of parameters and would take >100 million years at 1 GFLOP/s.
Training loop. All but the last token of an example is fed in; backpropagation nudges parameters to make the true last token slightly more likely. Repeated over trillions of examples.
Two training phases: pretraining (next-token prediction over internet text) and rlhf (humans flag unhelpful/unsafe responses; the model is tuned to prefer helpful ones). The pretraining objective differs from the chatbot goal, which is why RLHF is needed.
Pre-2017 language models processed text one word at a time (RNN/LSTM-style). The 2017 transformer processed all tokens in parallel, unlocking GPU utilisation and thus scale.

Each token is associated with a long vector (an embedding) that must encode meaning.
Attention lets these vectors communicate and refine each other based on context (e.g. “bank” near “river” shifts toward riverbank).
An MLP block after attention provides extra capacity to store learned patterns/facts.
Data flows through many repeated attention + MLP blocks. The final vector is passed through one last operation (the unembedding + softmax) to produce the next-token distribution.
Model behaviour is emergent from tuned parameters — researchers set the framework, not the specific rules the model ends up using, which is why interpretability is hard.