Summary: The DCT-II transforms a finite sequence into a sum of cosines at different frequencies, producing real-valued output with strong energy compaction that makes it ideal for image and audio compression.

The problem with DFT for compression

The discrete-fourier-transform of a real sequence produces complex-valued output — you need to store both real and imaginary parts (or magnitude and phase). More importantly, the DFT implicitly treats the sequence as periodic, so a discontinuity at the boundary (where the end of the block doesn’t match the start) creates high-frequency artefacts. For natural image blocks, this is almost always a problem.

The DCT avoids both issues.

DCT-II (the standard DCT)

There are eight DCT variants; “the DCT” in nearly all engineering contexts means DCT-II. Each DCT (scalar) coefficient, , is the amplitude of the -th frequency component. It is given by:

The shift is the key detail. Unlike the DFT, there is no complex exponential — the basis functions are purely real cosines.

In practice, these are computed via matrix multiplication

Computationally, each output coefficient is a dot product of the input sequence with the -th cosine basis vector , where . Applying DCT-II to an -point signal is therefore dot products, one per output frequency.

Why it’s real-valued: the even-extension trick

The DCT-II is equivalent to the discrete-fourier-transform applied to an even-extended version of the signal. Extend by reflecting it: form the -sample sequence . This extended signal is symmetric, so its DFT has no imaginary component. The DCT is (up to scaling) just the first values of that DFT.

The symmetry also eliminates the boundary discontinuity problem: the reflected signal is smooth at the join, so no artificial high-frequency content is introduced.

The inverse: DCT-III

The inverse of DCT-II (up to scaling) is DCT-III. Each (scalar) sample value, , (e.g. a single pixel intensity after level shift), is given by:

The DC coefficient () is handled separately because it lacks the factor of 2.

  • “DC” is borrowed from electrical engineering, referring to Fourier analysis in signal processing
  • DC = 0 Hz and AC = everything else

In practice, these are computed via matrix multiplication

Computationally, each output sample is also a dot product — of the coefficient vector with the same cosine basis evaluated at position . You’re not “inverting a dot product”; you’re doing new dot products in the other direction. This works because the cosine basis vectors are orthonormal: projecting onto them and summing them back reconstructs the original exactly.

Energy compaction

For natural signals (speech, images), the signal’s energy concentrates strongly in the low- DCT coefficients.

  • The coefficient is the mean (DC level);
  • capture progressively higher spatial/temporal frequencies (AC levels)

Quantitatively: for a first-order Markov model of image rows (adjacent pixels are correlated with coefficient ), the DCT is the asymptotically optimal decorrelating transform — it approaches the Karhunen-Loève transform for large . In practice on 8×8 image blocks, ~95% of the energy is in the top-left 15–20 coefficients of the 64 available.

TODO: energy compaction plot — bar chart showing DCT coefficient magnitudes for a typical 8×8 image block; energy rapidly decays from k=0 to k=63

This makes truncation cheap: set all high- coefficients to zero, reconstruct with IDCT. The error (visible as ringing or blurring) is small because those coefficients were small to begin with.

2D DCT for images

The 2D DCT-II is separable: apply a 1D DCT to every row, then a 1D DCT to every column of the result:

  • 1 DC component: . A constant, non-oscillating signal (0 Hz). No spatial variation,
    • Basis image: A flat, uniform, grey square.
    • JPEG encoding of DC coefs: Delta encoding
  • Rest are AC components: where . Spatial variation at increasing frequencies.
    • Basis images: Stripes, grids, and checkerboards of progressively finer detail.
    • JPEG encoding of AC coefs: Run Length Encoding (RLE)

Cost: using FFTs (vs. naive 2D). For JPEG’s 8×8 blocks, , so the fast algorithm uses a pre-computed 8-point DCT kernel.

Capitalisation convention, vs , depends on domain of the output

  • The forward transform (DCT-II) goes (spatial → frequency),
  • the inverse (DCT-III) goes (frequency → spatial).
  • This comes from signal processing. We follow the domain of the output:
    • lowercase for the time/spatial domain (),
    • uppercase for the frequency domain ().

The 64 basis images

The 2D DCT basis for 8×8 blocks is a set of 64 patterns — products of 1D cosine waves at different horizontal and vertical frequencies. The basis is flat (DC). The basis is a fine checkerboard (highest 2D frequency). Natural image blocks look like the low-frequency bases and have tiny projections onto the high-frequency ones.

8×8 basis images grid (apply to each channel, , , )

  • All 64 2D DCT basis patterns:
    • 1 DC (top-left, flat grey)
    • 63 high-frequency oscillating signals (AC) checkerboard
  • Examine the 1D DCT patterns (i.e. first row or first column)
    • Note how they are just cosine waves of increasing frequency

DCT

Computing DCT via FFT

A length- DCT can be computed with one length- fast-fourier-transform, taking advantage of the even-extension relationship. This reduces cost from to :

def dct2(x):
    N = len(x)
    v = np.concatenate([x, x[::-1]])   # even-extend
    V = np.fft.rfft(v)[:N]             # real FFT of extended signal
    k = np.arange(N)
    return np.real(V * np.exp(-1j * np.pi * k / (2 * N)))  # phase correction

Sources

No raw source files for this page — created as an implementation prerequisite.