RLHF (Reinforcement Learning from Human Feedback)

Summary: A fine-tuning phase applied to an already-pretrained LLM that uses human preference judgements as a training signal, bending the model from “good at continuing arbitrary internet text” toward “good at being a helpful, harmless assistant”.

Why it’s needed

pretraining teaches the model to maximise the likelihood of next tokens over a huge text corpus. This produces broad capabilities but the wrong objective: an internet text predictor is not the same thing as a helpful assistant. Raw pretrained models refuse poorly, follow instructions inconsistently, and readily continue harmful content that was in their training distribution. RLHF closes that gap — it’s what turns GPT-3-the-completer into ChatGPT-the-assistant.

The 3b1b coverage is intentionally high-level: human annotators flag unhelpful or problematic outputs, and the model is tuned to be more likely to produce preferred outputs and less likely to produce bad ones.

How it actually works (one sentence more detail)

Though 3b1b doesn’t go into this, the standard recipe — and what most readers will eventually want to know — has three stages:

Supervised fine-tuning (SFT). Human writers produce example dialogues showing what a good assistant response looks like. The pretrained model is fine-tuned on these via ordinary next-token prediction.
Reward model training. Human labellers rank pairs (or $k$ -tuples) of model outputs by preference. A separate reward model is trained to predict which response a human would prefer, producing a scalar “how good is this response” function.
RL fine-tuning. The LLM is fine-tuned with reinforcement learning (typically PPO) to maximise the reward model’s score, with a KL penalty pulling it back toward the SFT model to prevent runaway drift.

The pipeline above (SFT → RM → PPO) is the InstructGPT / original ChatGPT recipe. Newer approaches like DPO (Direct Preference Optimization) and RLAIF (AI Feedback) are simpler and increasingly mainstream — flag for a future ingest.

What it changes

Instruction following. The model actually does what it’s asked instead of continuing the prompt as text.
Refusals. The model learns to decline certain requests rather than complete them.
Tone and format. Outputs become more uniform, more polite, more verbose — sometimes usefully, sometimes annoyingly (the “RLHF voice”).
Knowledge, not so much. Facts live in the pretrained weights; RLHF mostly reshapes the distribution over outputs, not the underlying knowledge.

Known issues

Reward hacking. The model can discover ways to score well under the reward model that don’t reflect genuine quality — verbose hedging, sycophantic agreement with the user, refusing safely instead of answering well.
Mode collapse. Post-RLHF models often sound more same-y and produce lower-diversity outputs than their pretrained base.
Compute-cheap, data-precious. RLHF is much cheaper than pretraining in compute, but much more expensive in the cost-per-label sense — high-quality human ranking data doesn’t scale the way raw text does.

Sources

src-3b1b-llms-ch1-llms-briefly

notes/

RLHF (Reinforcement Learning from Human Feedback)

Why it’s needed

How it actually works (one sentence more detail)

What it changes

Known issues

See also

Sources

RLHF (Reinforcement Learning from Human Feedback)

Why it’s needed

How it actually works (one sentence more detail)

What it changes

Known issues

See also

Sources

Graph View

Backlinks

Explorer