AI & Machine Learning

Covers neural networks, model architectures, training dynamics, inference, and AI safety/policy.

neural-network — Overview: nodes, edges, weights, activations; links to specific architectures
multilayer-perceptron — Fully-connected feed-forward network (MLP); the simplest architecture
backpropagation — Algorithm that computes the cost-function gradient, $\nabla C_{θ}$ , via the chain rule
computation-graph — DAG of mathematical operations recorded during the forward pass; the structure autograd traverses to compute gradients
gradient-descent — Iterative optimisation: step in the negative-gradient direction , $- \nabla C$ , to minimise cost
cost-function — Scalar measure of how badly the network performs (e.g. mean squared error)
activation-function — Nonlinear function (sigmoid, ReLU) on layer’s weighted sum, giving a NN its expressive power

transformer-architecture — Token embeddings flowing through interleaved attention + MLP blocks; the backbone of modern LLMs
attention-mechanism — Q/K/V + softmax: how token embeddings share information based on context
multi-head-attention — Many attention heads per block running in parallel, each learning a different pattern
self-attention-vs-cross-attention — Two variants of an attention head differing in where Q, K, V come from
architecture-bias-and-weight-conventions — How “weight per arc, bias per neuron” generalises across MLP, CNN, RNN, Transformer
network-diagram-vs-computation-graph — Two graph-based views of the same neural network at different levels of abstraction

one-hot-encoding — Binary vector representation of categorical indices; prevents false ordinal relationships and acts as a row-select on the weight matrix
tokenization — Splitting text into subword tokens before embedding
word-embedding — Learned vectors per token; directions encode meaning; dot product measures alignment
unembedding — Final projection from residual stream to per-token scores
logits — Unnormalised pre-softmax scores over the vocabulary
softmax — Turns logits into a probability distribution; temperature controls sharpness

pretraining — Self-supervised next-token prediction over massive text corpora
rlhf — Post-pretraining fine-tuning using human preferences to bend models toward assistant behaviour

large-language-model — Top-level concept: what an LLM is, how sampling works, why scale matters
gpt-3 — OpenAI’s 175B-parameter transformer; running example for all the parameter counts here

superposition — Why features are spread across many neurons rather than one-per-neuron
johnson-lindenstrauss-lemma — The math result explaining why high-dim spaces can pack many near-orthogonal directions

Neural Networks series (3b1b):

src-3b1b-neural-networks-ch1 — Structure of a feed-forward neural network (MNIST)
src-3b1b-neural-networks-ch2 — Learning via gradient descent on a cost function
src-3b1b-neural-networks-ch3 — What the trained network actually learned (and didn’t)
src-3b1b-neural-networks-ch4 — Backpropagation intuition: three levers, Hebbian echoes, SGD
src-3b1b-neural-networks-ch5 — Backpropagation calculus: chain rule, multi-neuron generalisation

LLMs series (3b1b):

src-3b1b-llms-ch1-llms-briefly — Non-technical overview of LLMs, pretraining, RLHF, transformers
src-3b1b-llms-ch2-transformers — Transformer pipeline: embeddings, blocks, unembedding, softmax
src-3b1b-llms-ch3-attention — Single-head attention (Q/K/V, masking) and multi-head attention
src-3b1b-llms-ch4-mlps-store-facts — How MLP blocks might store facts; superposition

backprop-graph-terminology — Why “children”, “upstream”, and “downstream” mean what they do in a backprop computation graph

3blue1brown — Grant Sanderson’s math/ML YouTube channel; source of the Neural Networks and LLMs series

Related hubs