Discrete Cosine Transform (DCT)

Summary: The DCT-II transforms a finite sequence into a sum of cosines at different frequencies, producing real-valued output with strong energy compaction that makes it ideal for image and audio compression.

The problem with DFT for compression

The discrete-fourier-transform of a real sequence produces complex-valued output — you need to store both real and imaginary parts (or magnitude and phase). More importantly, the DFT implicitly treats the sequence as periodic, so a discontinuity at the boundary (where the end of the block doesn’t match the start) creates high-frequency artefacts. For natural image blocks, this is almost always a problem.

The DCT avoids both issues.

DCT-II (the standard DCT)

There are eight DCT variants; “the DCT” in nearly all engineering contexts means DCT-II. Each DCT (scalar) coefficient, $X [k]$ , is the amplitude of the $k$ -th frequency component. It is given by:

X [k] = n = 0 \sum N - 1 x [n] cos (\frac{π}{N} (n + \frac{1}{2}) k) for k = 0, 1, \dots, N - 1

The $n + 1/2$ shift is the key detail. Unlike the DFT, there is no complex exponential — the basis functions are purely real cosines.

In practice, these are computed via matrix multiplication

Computationally, each output coefficient $X [k]$ is a dot product of the input sequence $x$ with the $k$ -th cosine basis vector $b_{k}$ , where $b_{k} [n] = cos (\frac{π}{N} (n + \frac{1}{2}) k)$ . Applying DCT-II to an $N$ -point signal is therefore $N$ dot products, one per output frequency.

Example 1: 1D cosine addition (and 1D basis pattern)

Example 2: 1D cosine addition (and 1D basis pattern)

Why it’s real-valued: the even-extension trick

The DCT-II is equivalent to the discrete-fourier-transform applied to an even-extended version of the signal. Extend $x [0], \dots, x [N - 1]$ by reflecting it: form the $2 N$ -sample sequence $x [0], x [1], \dots, x [N - 1], x [N - 1], \dots, x [0]$ . This extended signal is symmetric, so its DFT has no imaginary component. The DCT is (up to scaling) just the first $N$ values of that DFT.

The symmetry also eliminates the boundary discontinuity problem: the reflected signal is smooth at the join, so no artificial high-frequency content is introduced.

The inverse: DCT-III

The inverse of DCT-II (up to scaling) is DCT-III. Each (scalar) sample value, $x [n]$ , (e.g. a single pixel intensity after level shift), is given by:

x [n] = \frac{X [ 0 ]}{N} + \frac{2}{N} k = 1 \sum N - 1 X [k] cos (\frac{π}{N} (n + \frac{1}{2}) k)

The DC coefficient ( $k = 0$ ) is handled separately because it lacks the factor of 2.

“DC” is borrowed from electrical engineering, referring to Fourier analysis in signal processing
DC = 0 Hz and AC = everything else

In practice, these are computed via matrix multiplication

Computationally, each output sample $x [n]$ is also a dot product — of the coefficient vector $X$ with the same cosine basis evaluated at position $n$ . You’re not “inverting a dot product”; you’re doing $N$ new dot products in the other direction. This works because the cosine basis vectors are orthonormal: projecting onto them and summing them back reconstructs the original exactly.

Energy compaction

For natural signals (speech, images), the signal’s energy concentrates strongly in the low- $k$ DCT coefficients.

The $k = 0$ coefficient is the mean (DC level);
$k = 1, 2, \dots$ capture progressively higher spatial/temporal frequencies (AC levels)

Quantitatively: for a first-order Markov model of image rows (adjacent pixels are correlated with coefficient $ρ \approx 0.9$ ), the DCT is the asymptotically optimal decorrelating transform — it approaches the Karhunen-Loève transform for large $N$ . In practice on 8×8 image blocks, ~95% of the energy is in the top-left 15–20 coefficients of the 64 available.

TODO: energy compaction plot — bar chart showing DCT coefficient magnitudes for a typical 8×8 image block; energy rapidly decays from k=0 to k=63

This makes truncation cheap: set all high- $k$ coefficients to zero, reconstruct with IDCT. The error (visible as ringing or blurring) is small because those coefficients were small to begin with.

2D DCT for images

The 2D DCT-II is separable: apply a 1D DCT to every row, then a 1D DCT to every column of the result:

X [k_{r}, k_{c}] = r = 0 \sum N - 1 c = 0 \sum N - 1 x [r, c] cos (\frac{π ( r + \frac{1}{2} ) k _{r}}{N}) cos (\frac{π ( c + \frac{1}{2} ) k _{c}}{N})

1 DC component: $X [0, 0]$ . A constant, non-oscillating signal (0 Hz). No spatial variation,
- Basis image: A flat, uniform, grey square.
- JPEG encoding of DC coefs: Delta encoding
Rest are AC components: $X [k_{r}, k_{c}]$ where $(k_{r}, k_{c}) \neq = (0, 0)$ . Spatial variation at increasing frequencies.
- Basis images: Stripes, grids, and checkerboards of progressively finer detail.
- JPEG encoding of AC coefs: Run Length Encoding (RLE)

Cost: $O (N^{2} lo g N)$ using FFTs (vs. $O (N^{4})$ naive 2D). For JPEG’s 8×8 blocks, $N = 8$ , so the fast algorithm uses a pre-computed 8-point DCT kernel.

Capitalisation convention, $X [k]$ vs $x [n]$ , depends on domain of the output

The forward transform (DCT-II) goes $x \to X$ (spatial → frequency),

the inverse (DCT-III) goes $X \to x$ (frequency → spatial).

This comes from signal processing. We follow the domain of the output:

lowercase for the time/spatial domain ( $x$ ),

uppercase for the frequency domain ( $X$ ).

The 64 basis images

The 2D DCT basis for 8×8 blocks is a set of 64 patterns — products of 1D cosine waves at different horizontal and vertical frequencies. The $(0, 0)$ basis is flat (DC). The $(7, 7)$ basis is a fine checkerboard (highest 2D frequency). Natural image blocks look like the low-frequency bases and have tiny projections onto the high-frequency ones.

8×8 basis images grid (apply to each channel, $Y$ , $C_{r}$ , $C_{b}$ )

All 64 2D DCT basis patterns:

1 DC (top-left, flat grey)

63 high-frequency oscillating signals (AC) checkerboard

Examine the 1D DCT patterns (i.e. first row or first column)

Note how they are just cosine waves of increasing frequency

Computing DCT via FFT

A length- $N$ DCT can be computed with one length- $2 N$ fast-fourier-transform, taking advantage of the even-extension relationship. This reduces cost from $O (N^{2})$ to $O (N lo g N)$ :

def dct2(x):
    N = len(x)
    v = np.concatenate([x, x[::-1]])   # even-extend
    V = np.fft.rfft(v)[:N]             # real FFT of extended signal
    k = np.arange(N)
    return np.real(V * np.exp(-1j * np.pi * k / (2 * N)))  # phase correction

Sources

No raw source files for this page — created as an implementation prerequisite.

Discrete Cosine Transform (DCT)

The problem with DFT for compression

DCT-II (the standard DCT)

Why it’s real-valued: the even-extension trick

The inverse: DCT-III

Energy compaction

2D DCT for images

The 64 basis images

Computing DCT via FFT

Sources

Graph View

Backlinks

Explorer