Large Language Model (LLM)

Summary: A neural network — in practice always a transformer — trained to assign a probability distribution over the next token given a prefix of text, and used generatively by sampling from that distribution repeatedly.

The core function

An LLM is a function $f_{θ} : tokens^{\leq n} \to Δ^{∣ V ∣}$ that takes up to $n$ tokens (the context window) and returns a probability distribution over a vocabulary $V$ of ~50K tokens.

Generation is not a new capability — it’s the same function called in a loop: sample a token from $f_{θ} (prefix)$ , append it to the prefix, call $f_{θ}$ again. The model itself is deterministic; the randomness lives in the sampling step. This is why the same prompt yields different outputs across runs.

Chatbots are just LLMs prompted with a hidden preamble describing a helpful AI assistant, followed by the user’s turn, with the model’s continuation presented as the assistant’s reply.

Why sample at all?

Picking the argmax at every step produces text that is bland and locally repetitive. Sampling with some randomness (often controlled by a softmax temperature) gives more natural, diverse outputs — at the cost of occasional nonsense.

Scale is the story

Parameters. GPT-3 has 175B; frontier models are substantially larger. Parameters are also called weights — they are the real-valued knobs that backprop tunes. GPT-3 has ~28,000 weight matrices across 8 categories (embedding, unembedding, Q/K/V × layers, MLP up/down × layers, etc.).
Data. GPT-3 was trained on roughly 2,600 person-years of reading. Later models train on much more.
Compute. Training GPT-3 would take over $1 0^{8}$ years at $1 0^{9}$ ops per second. Only made feasible by parallel hardware (GPUs/TPUs) and the transformer’s parallelism-friendly structure.

Empirically, capability scales remarkably smoothly with (parameters × data × compute). This is the backdrop for scaling laws and the recent capability boom.

Training in two phases

pretraining — self-supervised next-token prediction on massive internet text. Cheap per example (every position is a training signal) but expensive in aggregate. Produces a model that is good at continuing arbitrary text, not necessarily at being helpful.
rlhf — reinforcement learning from human feedback. Humans rate or rank outputs; the model is further tuned to prefer outputs humans like. This is what bends a next-token predictor into an assistant.

The goals of the two phases are genuinely different: “continue this Reddit thread” ≠ “answer this user’s question helpfully and safely.” Both phases matter.

Why transformers

Pre-2017 language models (RNNs, LSTMs) processed text one token at a time, which caps how much GPU parallelism they can exploit. Transformers (Vaswani et al., 2017) process a whole context in parallel via attention, so training cost scales with hardware rather than sequence length in the same way. This parallelism, not any specific inductive bias, is the main reason transformers won.

Interpretability is hard

Model behaviour is emergent from tuned weights. Researchers design the architecture; the specific “rules” the model uses are whatever minimises the training loss. This is why you cannot read off why a model made a prediction from its parameters directly — and why things like superposition and sparse autoencoders matter.

notes/

Large Language Model (LLM)

The core function

Why sample at all?

Scale is the story

Training in two phases

Why transformers

Interpretability is hard

See also

Sources

Large Language Model (LLM)

The core function

Why sample at all?

Scale is the story

Training in two phases

Why transformers

Interpretability is hard

See also

Sources

Graph View

Backlinks

Explorer