Pretraining

Summary: The first and largest phase of training an LLM — self-supervised next-token prediction on massive amounts of internet text, producing a model that is good at continuing arbitrary text. Pretraining does not, by itself, produce a helpful chatbot; it produces a competent text completer that later phases (rlhf) bend into an assistant.

The objective

For every training example — anywhere from a handful of tokens to thousands — the model is shown all tokens up to position $k - 1$ and asked to predict token $k$ . The loss is the cross-entropy between the true next token and the model’s predicted distribution:

$L = - \sum_{k} lo g p_{θ} (t_{k} ∣ t_{1}, \dots, t_{k - 1})$

backpropagation nudges the parameters $θ$ so that the true next token gets a slightly higher probability and the others slightly lower.

No human labels are needed — the training signal comes from the text itself. This is what makes it self-supervised and why it can scale to trillions of tokens cheaply.

Why it works

Next-token prediction sounds trivial — autocomplete — but it’s secretly hard enough to force the model to learn almost everything about language:

Syntax (grammatical continuations are more likely).
Semantics (coherent continuations are more likely).
World knowledge (factually correct continuations are more likely, if facts are in the training data).
Reasoning (problems whose solutions are written out in the corpus are solved more often correctly).
Style, register, persona, code, math, and so on.

A model that’s genuinely good at predicting what comes next on arbitrary text has, implicitly, modelled most of what that text is about. This is why raw pretraining produces capabilities that far exceed “just autocomplete”.

Data, scale, and cost

Data. Internet-scale text: Common Crawl, Wikipedia, books, code, filtered and deduplicated. GPT-3 used roughly 500B tokens; frontier models use much more.
Compute. The bulk of training compute for a model goes into pretraining. For GPT-3-scale, thousands of GPU-years. For frontier models, orders of magnitude more.
Parallelism. Every position in every sequence contributes a training signal simultaneously — a length- $n$ sequence gives $n$ loss terms per forward pass. This efficiency is what makes pretraining feasible at all and depends on the causal mask in attention (see src-3b1b-llms-ch3-attention).

The goal mismatch

Pretraining’s objective is “continue this text the way the internet would continue it”. That’s very different from “answer this user’s question helpfully and safely”. A raw pretrained model will:

Continue harmful or factually wrong content if the prompt looks like that kind of text.
Answer a question with another question, because Q-Q is a common pattern on forums.
Drift into formats that statistically match the prompt even when they’re not useful.

Closing this gap is the job of rlhf and instruction tuning, which come after pretraining and use much less compute but much more targeted data.

Sources

src-3b1b-llms-ch1-llms-briefly

notes/

Pretraining

The objective

Why it works

Data, scale, and cost

The goal mismatch

See also

Sources

Pretraining

The objective

Why it works

Data, scale, and cost

The goal mismatch

See also

Sources

Graph View

Backlinks

Explorer