JPEG Compression

Summary: JPEG compresses images by converting to a perceptual colour space, splitting into 8×8 blocks, applying the discrete-cosine-transform, discarding high-frequency DCT coefficients via quantization, and entropy coding the result.

Overview

JPEG (1992) exploits two properties of human vision:

We are more sensitive to luminance (brightness/intensity) than chrominance (colour).
We are more sensitive to low spatial frequencies (large shapes) than high frequencies (fine detail).
1. Refers to how frequently a bit of an image changes in intensity of brightness or colour

The algorithm is lossy: the original pixel values cannot be recovered exactly. Loss is entirely concentrated in the quantization step; all other steps are lossless. Chroma downsampling is optional, and also lossy.

Encoder pipeline

Running dimensionality count (illustrative only)

Imagine an input image with $100 \times 100$ pixels split across 3 channels: $R, G, B \in R^{100 \times 100}$

Matrix entry values for $R$ , $G$ , and $B$ range between $[0, 255]$

Total: $10, 000$ values per channel, $30, 000$ total

1. Colour space conversion: RGB → YCbCr

Separate the luminosity (brightness) of an image from the chrominance (colour).

Y C_{b} C_{r} = 0.299 - 0.169 0.500 0.587 - 0.331 - 0.419 0.114 0.500 - 0.081 R G B + 0128128

$Y$ = luminance (roughly perceived brightness). Coefficients match the eye’s spectral sensitivity (green dominates).
$C_{b}$ and $C_{r}$ = blue-difference (i.e. blueness) and red-difference (i.e. redness) chroma channels

Images: Colour space conversion (RBG → YCbCr)

Note: To human eyes, luminance is easy to perceive, but both chrominances are difficult to perceive.

Image 1:

Image 2:

Chroma downsampling (optional, also lossy!)

Human eyes are not very sensitive to colour ( $C_{b}$ and $C_{r}$ ), compared to brightness ( $Y^{'}$ ), so downsample the colour information (i.e. reduce colour resolution):

4:4:4 - No downsampling
4:2:2 - Downsample $C_{b}$ and $C_{r}$ by 2× horizontally only.
4:2:0 - Downsample $C_{b}$ and $C_{r}$ by 2× horizontally and 2× vertically. 4× reduction in total
- Four $Y$ pixels share one $C_{b}$ and one $C_{r}$ value.
- This quarters the chroma data with minimal visible impact, halving the total data (~2× compression) before any DCT.

Images: Chrominance downsampling (4:2:0)

Note: Chrominance matrices are now a quarter of their original size. Luminance is unchanged.

Image 1:

Image 2:

Dimensionality count (illustrative only)

After conversion: $Y, C_{b}, C_{r} \in R^{100 \times 100}$ — same shape, different semantics.

After 4:2:0 subsampling, $C_{b}$ and $C_{r}$ are a quarter of their original sizes:

Luminance is unchanged: $Y \in R^{100 \times 100}$ (i.e. 10k pixels),

But chrominance is reduced: $C_{b}, C_{r} \in R^{50 \times 50}$ (i.e. 2.5k pixels each).

Total values: $10 0^{2} + 2 \times 5 0^{2} = 10, 000 + 5, 000 = 15, 000$ — already half the original $30, 000$ .

2. Block splitting and level shift

For each channel $Y$ , $C_{b}$ , and $C_{r}$

Block splitting: Divide the channel ( $Y$ , $C_{b}$ , and $C_{r}$ ) into non-overlapping $8 \times 8$ pixel blocks, or sub-images.
Level shift: Subtract 128 from each value (shift unsigned $[0, 255]$ → signed $[- 128, 127]$ ).
1. This centres the data for DCT: $- 128$ is black, $127$ is white. This matches the cosine wave’s interval $[- 1, 1]$

If the image dimensions are not multiples of 8, pad the edges by repeating the boundary pixels (then discard the padding after decoding).

Visualise matrices: Level shift on one sub-image ( $8 \times 8$ pixel block)

$8 \times 8 sub-image: raw pixel values \in [0, 255] 52636263677985875559595861657179615568716860646966901131221047059687010914415412677556561851041068868617664696670685865787372736970758394 - 128 = c ⟶ x : level-shifted values \in [- 128, 127] - 76 - 65 - 66 - 65 - 61 - 49 - 43 - 41 - 73 - 69 - 69 - 70 - 67 - 63 - 57 - 49 - 67 - 73 - 60 - 57 - 60 - 68 - 64 - 59 - 62 - 38 - 15 - 6 - 24 - 58 - 69 - 60 - 58 - 19 1626 - 2 - 51 - 73 - 63 - 67 - 43 - 24 - 22 - 40 - 60 - 67 - 52 - 64 - 59 - 62 - 58 - 60 - 70 - 63 - 50 - 55 - 56 - 55 - 59 - 58 - 53 - 45 - 34 ↓ ⏐ r$

Running example

100 is not a multiple of 8, so pad each dimension to 104 (next multiple of 8).

$Y \in R^{104 \times 104}$ , $C_{b}, C_{r} \in R^{56 \times 56}$ after padding.

Block counts: $Y$ gives $13 \times 13 = 169$ blocks; $C_{b}, C_{r}$ give $7 \times 7 = 49$ blocks each.

Total: $169 + 49 + 49 = 267$ blocks, each an $8 \times 8$ matrix.

3. 2D DCT per block

Analytic calculation

Apply the 2D DCT-II to each 8×8 block, individually for each channel $Y$ , $C_{b}$ , and $C_{r}$ . Element-wise formula:

X [k_{r}, k_{c}] = r = 0 \sum N - 1 c = 0 \sum N - 1 x [r, c] cos (\frac{π ( r + \frac{1}{2} ) k _{r}}{N}) cos (\frac{π ( c + \frac{1}{2} ) k _{c}}{N})

where:
- $x [r, c]$ : pixel value at row $r$ , column $c$ in the level-shifted $8 \times 8$ block
- $X [k_{r}, k_{c}]$ : DCT coefficient at row frequency $k_{r}$ , column frequency $k_{c}$
- $r \in {0, \dots, 7}$ : spatial row index (vertical)
- $c \in {0, \dots, 7}$ : spatial column index (horizontal)
- $k_{r} \in {0, \dots, 7}$ : vertical spatial frequency index
- $k_{c} \in {0, \dots, 7}$ : horizontal spatial frequency index
- $N = 8$ : block size
This is the unnormalized DCT-II (see wikipedia for normalized). The 64 pixel values become 64 DCT coefficients:
- $X [0, 0]$ : the DC coefficient — equals 8 times the block mean; corresponds to the flat (constant) basis image
- $X [k_{r}, k_{c}]$ for $k_{r} + k_{c} > 0$ : the 63 AC coefficients — capture spatial frequency content at increasing row and column frequencies

Visualise matrices: 2D DCT on one sub-image ( $8 \times 8$ pixel block)

$↓ ⏐ r c ⟶ x : level-shifted pixel values \in [- 128, 127] - 76 - 65 - 66 - 65 - 61 - 49 - 43 - 41 - 73 - 69 - 69 - 70 - 67 - 63 - 57 - 49 - 67 - 73 - 60 - 57 - 60 - 68 - 64 - 59 - 62 - 38 - 15 - 6 - 24 - 58 - 69 - 60 - 58 - 19 1626 - 2 - 51 - 73 - 63 - 67 - 43 - 24 - 22 - 40 - 60 - 67 - 52 - 64 - 59 - 62 - 58 - 60 - 70 - 63 - 50 - 55 - 56 - 55 - 59 - 58 - 53 - 45 - 34 2D DCT ↓ ⏐ k_{r} k_{c} ⟶ X : Unquantized DCT coefficients (e.g. luminance channel) - 415.38 4.47 - 46.83 - 48.53 12.12 - 7.73 - 1.03 - 0.17 - 30.19 - 21.86 7.37 12.07 - 6.55 2.91 0.18 0.14 - 61.20 - 60.76 77.13 34.10 - 13.20 2.38 0.42 - 1.07 27.24 10.25 - 24.56 - 14.76 - 3.95 - 5.94 - 2.42 - 4.19 56.12 13.15 - 28.91 - 10.24 - 1.87 - 2.38 - 0.88 - 1.17 - 20.10 - 7.09 9.93 6.30 1.75 0.94 - 3.02 - 0.10 - 2.39 - 8.54 5.42 1.83 - 2.79 4.30 4.12 0.50 0.46 4.88 - 5.65 1.95 3.14 1.85 - 0.66 1.68$

Animation: 64 DCT basis images (64 $8 \times 8$ px blocks), used to reconstruct a sub-image ( $8 \times 8$ px block)

The 64 DCT basis images (each image is an $8 \times 8$ pixel block)

Animation: Reconstructing a sub-image ( $8 \times 8$ pixel block), but combining a weighted sum of these basis images:

In-practice, dot product with precomputed basis image blocks

In practice, the 64 DCT coefficients are just 64 dot products. Each DCT coefficient $X [k_{r}, k_{c}] \in R$ is the dot product of the pixel block $x \in R^{8 \times 8}$ with the corresponding DCT basis image $B_{k_{r}, k_{c}} \in R^{8 \times 8}$ :

X [k_{r}, k_{c}] = r = 0 \sum 7 c = 0 \sum 7 x [r, c] \cdot B_{k_{r}, k_{c}} [r, c]

where $B_{k_{r}, k_{c}} [r, c] = cos (\frac{π ( r + \frac{1}{2} ) k _{r}}{N}) cos (\frac{π ( c + \frac{1}{2} ) k _{c}}{N})$ is an $8 \times 8$ array of cosine values that can be precomputed once.

To compute, flatten the $8 \times 8$ pixel block into a 64-element vector $x \in R^{8 \times 8}$ , and each basis image $B_{k_{r}, k_{c}}$ likewise. Then:

X [k_{r}, k_{c}] = x \cdot B_{k_{r}, k_{c}}

Each DCT coefficient $X [k_{r}, k_{c}]$ answers: how much of basis image $B_{k_{r}, k_{c}}$ is needed to reconstruct this block $x$ ?

Concatenate all 64 flattened basis images as rows of a matrix $D \in R^{64 \times 64}$ , and the entire DCT becomes one matrix multiply:

X = Dx

Because the basis images are orthonormal, $D$ is orthogonal and the inverse DCT is simply $D^{⊤}$ . In practice, $D$ is precomputed once — no cosines at runtime.

Intuition (for a smooth pixel block)

Consider a smooth, medium-gray $8 \times 8$ block where all Y values are $130$ . After the mandatory $- 128$ level shift, all pixel values are $2$ .

$X [0, 0] = \sum_{r, c} 2 \cdot 1 = 64 \times 2 = 128$ — high energy, because the flat basis $B_{0, 0}$ is all 1s, so every pixel contributes equally

$X [7, 7] \approx 0$ — the high-frequency basis $B_{7, 7}$ is a cosine product that oscillates rapidly between positive and negative values; for a uniform block these contributions cancel in the dot product

Dimensionality count (illustrative only)

Each $8 \times 8$ pixel block $\to$ $8 \times 8$ DCT coefficient matrix. Shape unchanged: 267 blocks, each $\in R^{8 \times 8}$ .

$X [0, 0]$ (DC) ≈ $8 \times$ block mean (typically a large value like 500–900 for bright regions).

$X [7, 7]$ (highest frequency AC) ≈ near zero for any smooth block.

4. Quantization — the lossy step

Divide each DCT coefficient by a quantization value $Q [k_{r}, k_{c}]$ and round to the nearest integer:

\hat{X} [k_{r}, k_{c}] = round (\frac{X [ k _{r} , k _{c} ]}{Q [ k _{r} , k _{c} ]}), for k_{r}, k_{c} \in {0, ..., 7}

The JPEG standard defines a quantization table where $Q [k_{r}, k_{c}]$ increases with frequency. High-frequency coefficients are divided by large numbers → they round to zero. The low-frequency DC coefficient has a small $Q$ value (precise).
There are separate tables for luminance ( $Y$ ) and chrominance ( $C_{b}$ , $C_{r}$ ) — the chroma tables use larger quantization values, again exploiting lower sensitivity.
JPEG Quality factor $q$ (1–100): Scales the quantization table. At $q = 50$ , the standard tables are used.
- Higher quality: At $q > 50$ : $Q_{scaled} = Q \cdot (100 - q) /50$ (smaller $Q$ entries → finer quantization → less loss).
- Lower quality: At $q < 50$ : $Q_{scaled} = Q \cdot 50/ q$ (larger $Q$ entries → more loss).

This division and rounding is the only place information is destroyed. Everything else in the pipeline is exactly reversible.

Visualise matrices: Quantize the luminance ( $Y$ ) DCT coefs for one sub-image ( $8 \times 8$ pixel block)

$X : Unquantized DCT coefficients (e.g. luminance channel) - 415.38 4.47 - 46.83 - 48.53 12.12 - 7.73 - 1.03 - 0.17 - 30.19 - 21.86 7.37 12.07 - 6.55 2.91 0.18 0.14 - 61.20 - 60.76 77.13 34.10 - 13.20 2.38 0.42 - 1.07 27.24 10.25 - 24.56 - 14.76 - 3.95 - 5.94 - 2.42 - 4.19 56.12 13.15 - 28.91 - 10.24 - 1.87 - 2.38 - 0.88 - 1.17 - 20.10 - 7.09 9.93 6.30 1.75 0.94 - 3.02 - 0.10 - 2.39 - 8.54 5.42 1.83 - 2.79 4.30 4.12 0.50 0.46 4.88 - 5.65 1.95 3.14 1.85 - 0.66 1.68 \div Q : standard luminance JPEG quantization table 1612141418244972111213172235649210141622375578951619242956648798242640516881103112405857871091041211005160698010311312010361555662779210199 = \hat{X} : quantized (rounded) DCT coefficients - 26 0 - 3 - 3 1000 - 3 - 2 110000 - 6 - 4 520000 21 - 1 - 1 0000 21 - 1 00000 - 1 0000000 0000000000000000$

Example: The DC coefficient: $round (\frac{- 415.38}{16}) = round (- 25.96125) = - 26$

Top left quantized coefficients are much larger. Our eyes perceive low frequency patterns well.

Bottom right quantized coefficients are smaller. Our eyes dont really perceive high frequency patterns anyway.

Dimensionality count (illustrative only)

Each coefficient divided and rounded: 267 blocks, each $\in Z^{8 \times 8}$ .

At quality 50, typically 40–55 of the 63 AC coefficients per sub-image (pixel block) round to zero

the zig-zag tail is a long run of $0$ s.

5. Serialization: Zig-zag scan

Reorder the 8×8 quantized coefficient block into a 1D array following a zig-zag path:

Image of serialization path

This orders coefficients from low to high 2D frequency. After quantization, the array typically ends with a long run of zeros (high-frequency coefficients that rounded to zero) — ideal for run-length encoding.

Visualise array of serialised DCT coefficients

Quantized DCT coefficients, now serialized: $\hat{X}_{s}$ .

This is a 1D array of length 64. The format below is only for ease of understanding:

$[- 26 - 3 - 3 21 - 1 000000000 0 - 2 - 4 1100000000] - 6 15 - 1 0000000 - 3 12 - 1 00000 20 - 1 0000 000000000$

Dimensionality count (illustrative only)

Each $8 \times 8$ block $\to$ a length-64 integer vector. Order: DC first, then AC from low to high frequency.

Typical vector for a smooth $Y$ block: $[482, - 3, 1, 0, 0, 0, \dots, 0]$ — non-zero values cluster at the start.

Total: $267 \times 64 = 17, 088$ integers, before entropy coding.

6. DC coefficient: delta coding

The DC coefficient of each block (the block mean) changes slowly across the image. Rather than encoding it absolutely, encode the difference from the previous block’s DC:

Δ_{DC} = D C_{current} - D C_{previous}

This reduces the number of bits needed for smooth images.

Visualise: DC delta coding for our running block

The quantized DC coefficient of the current block is $\hat{X} [0, 0] = - 26$ .

Suppose the previous $Y$ block had quantized DC $= - 15$ , then $Δ_{DC} = - 26 - (- 15) = - 11$ .

$- 11$ is encoded instead of $- 26$ — smaller magnitude → fewer bits under Huffman.

For the very first block in a channel, the previous DC is defined as $0$ , so the raw DC value is sent directly.

Dimensionality count (illustrative only)

DC sequences (one value per block, per channel):

$Y$ : 169 DC values $\to$ 169 deltas, e.g. $[482, - 12, 3, 0, 5, \dots]$

$C_{b}, C_{r}$ : 49 deltas each

Deltas are small for smooth images, compressing well under Huffman.

7. AC coefficients: Run Length Encoding (RLE)

Encode the remaining 63 AC coefficients (after zig-zag) as the following 2 symbols:

Symbol 1	Symbol 2
`(RUNLENGTH, SIZE)`	`(AMPLITUDE)`

$x$ : non-zero, quantized AC coefficient
Symbol 1 (concatenated and Huffman-coded together):
- RUNLENGTH: number of zeros preceeding the current coefficient (0–15)
- SIZE: number of bits required to represent $x$ : $⌊ lo g_{2} ∣ AMPLITUDE ∣ ⌋ + 1$
- Special symbols:
  - (0,0) = end-of-block — no AMPLITUDE follows, all remaining coefficients are zero;
  - (15,0) = zero-run-length marker (ZRL) — 16 consecutive zeros, no AMPLITUDE follows.
Symbol 2:
- AMPLITUDE: the actual coefficient value (bit representation of $x$ )
  - appended as SIZE raw bits, after Huffman code; not Huffman-coded itself

Visualise: AC RLE for our running block

AC values from the serialized zig-zag array (positions 1–63):
$\hat{X}_{s} = [- 3, 1 zero 0, - 3, - 2, - 6, 2, - 4, 1, - 3, 1, 1, 5, 1, 2, - 1, 1, - 1, 2, 5 zeros 0, 0, 0, 0, 0, - 1, - 1, 38 zeros 0, \dots, 0]$
Encoding each non-zero coefficient, $x$ , as (RUNLENGTH, SIZE)(AMPLITUDE), we get the following 20 units which encode all 63 AC values.

The (RUNLENGTH, SIZE) part is Huffman-coded;

The AMPLITUDE is appended raw.

(RUNLENGTH, SIZE) AMPLITUDE Note
(0, 2) $- 3$ 0 preceding zeros; SIZE 2 covers $∣ x ∣ \in {2, 3}$
(1, 2) $- 3$ 1 zero before this value, SIZE 2 to store it
(0, 2) $- 2$
(0, 3) $- 6$ SIZE 3 covers $∣ x ∣ \in {4 \dots 7}$
(0, 2) $2$
(0, 3) $- 4$
(0, 1) $1$ SIZE 1 covers $∣ x ∣ = 1$
(0, 2) $- 3$
(0, 1) $1$
(0, 1) $1$
(0, 3) $5$
(0, 1) $1$
(0, 2) $2$
(0, 1) $- 1$
(0, 1) $1$
(0, 1) $- 1$
(0, 2) $2$
(5, 1) $- 1$ 5 zeros before this value
(0, 1) $- 1$
(0, 0) — EOB: 38 trailing zeros collapsed to one symbol

`(RUNLENGTH, SIZE)`	`AMPLITUDE`	Note
(0, 2)	$- 3$	0 preceding zeros; `SIZE` 2 covers $∣ x ∣ \in {2, 3}$
(1, 2)	$- 3$	1 zero before this value, `SIZE` 2 to store it
(0, 2)	$- 2$
(0, 3)	$- 6$	`SIZE` 3 covers $∣ x ∣ \in {4 \dots 7}$
(0, 2)	$2$
(0, 3)	$- 4$
(0, 1)	$1$	`SIZE` 1 covers $∣ x ∣ = 1$
(0, 2)	$- 3$
(0, 1)	$1$
(0, 1)	$1$
(0, 3)	$5$
(0, 1)	$1$
(0, 2)	$2$
(0, 1)	$- 1$
(0, 1)	$1$
(0, 1)	$- 1$
(0, 2)	$2$
(5, 1)	$- 1$	5 zeros before this value
(0, 1)	$- 1$
(0, 0)	—	EOB: 38 trailing zeros collapsed to one symbol

Dimensionality count (illustrative only)

A $Y$ block with zig-zag AC values $[- 3, 1, 0, 0, 0, \dots, 0]$ (60 trailing zeros) encodes as:

$(0, 2) (- 3), (0, 1) (1), EOB (0, 0)$ — 3 units instead of 63 values.

The end-of-block symbol is the key savings: any block with a long zero tail gets a single terminator.

8. Huffman coding

Encode the (RUNLENGTH, SIZE) and $Δ_{DC}$ symbols using Huffman codes. The JPEG standard defines default Huffman tables (alternatively, the encoder can derive optimal tables from the image — “optimised Huffman”). The Huffman coder is lossless: it assigns shorter bit sequences to more frequent symbols.

The output is a compressed bitstream with a JPEG File Interchange Format (JFIF/EXIF) header describing the tables used. JFIF is a wrapper holding the compressed data.

Dimensionality count — final tally

Stage Data size
Raw RGB pixels $30, 000$ values × 8 bits = 240 kbits
After 4:2:0 $15, 000$ values × 8 bits = 120 kbits
After DCT + quantization (quality 50) $267 \times \sim 12$ non-zero coefficients × ~4 bits avg ≈ ~51 kbits
After Huffman typically ~45–60 kbits for a photographic image

Compression ratio: roughly 4:1 at quality 50 for this 100×100 example.

Stage	Data size
Raw RGB pixels	$30, 000$ values × 8 bits = 240 kbits
After 4:2:0	$15, 000$ values × 8 bits = 120 kbits
After DCT + quantization (quality 50)	$267 \times \sim 12$ non-zero coefficients × ~4 bits avg ≈ ~51 kbits
After Huffman	typically ~45–60 kbits for a photographic image

Decoder pipeline

Exactly reverse:

Huffman decode → RLE symbols → AC + DC coefficients
Zig-zag inverse → 8×8 coefficient matrix
Dequantize: multiply by $Q [k_{r}, k_{c}]$
Inverse 2D DCT → pixel block
Add 128 (undo level shift)
Upsample chroma channels (bilinear or nearest)
YCbCr → RGB

The dequantization step cannot recover the lost precision — if $X [7, 7] / Q [7, 7]$ rounded to 0, multiplying 0 back by $Q [7, 7]$ gives 0, not the original value.

Compression artefacts

Blocking: At low quality, quantization introduces large differences between adjacent blocks. Since DCT is applied independently per block, there is no information across block boundaries → visible 8×8 grid pattern.

Ringing (Gibbs phenomenon): Near sharp edges, the DCT is being asked to represent a discontinuity with a truncated frequency series. This causes oscillation on both sides of the edge, similar to the Gibbs phenomenon in Fourier series.

Colour bleeding: Chroma subsampling plus low-quality chrominance DCT causes colour to bleed across sharp luminance edges.

Typical compression ratios

Quality	Ratio	Use case
95	~3:1	Archival, print
80	~8:1	Web photos
60	~15:1	Thumbnails
30	~30:1	Preview images

JPEG is poorly suited for graphics with sharp edges, text, or flat colour regions — PNG (lossless) is better there. JPEG excels at photographic images where high-frequency DCT coefficients are genuinely small.

Sources

Direct source: The Ultimate Guide to JPEG including JPEG Compression
Branch Education: How are Images Compressed? [46MB ↘↘ 4.07MB] JPEG In Depth
Computerphile:

JPEG Compression

Overview

Encoder pipeline

1. Colour space conversion: RGB → YCbCr

Chroma downsampling (optional, also lossy!)

2. Block splitting and level shift

3. 2D DCT per block

Analytic calculation

In-practice, dot product with precomputed basis image blocks

4. Quantization — the lossy step

5. Serialization: Zig-zag scan

6. DC coefficient: delta coding

7. AC coefficients: Run Length Encoding (RLE)

8. Huffman coding

Decoder pipeline

Compression artefacts

Typical compression ratios

Sources

Graph View

Backlinks

Explorer