Summary: DDR5’s 128-bit burst buffer implements 16n prefetch — the DRAM array fetches 128 bits per chip in one internal core cycle, then streams them as 16 consecutive 8-bit transfers, matching the x86 cache line size exactly and reducing column-multiplexer switching overhead by 16×.

The prefetch architecture

Prefetch depth (n) describes how many bits the DRAM internally fetches from the array per external I/O cycle. Higher prefetch depth decouples the internal array clock from the fast external interface, letting the array operate at a lower, more stable frequency.

GenerationPrefetchInternal core clock (DDR5-4800 equivalent)
SDR1n= external I/O clock
DDR2n½ × external I/O clock
DDR24n¼ × external I/O clock
DDR3/DDR48n⅛ × external I/O clock
DDR516n¹⁄₁₆ × external I/O clock

For DDR5-4800 (2,400 MHz external clock, since DDR doubles each edge):

  • Internal array core runs at: 2,400 MHz ÷ 8 = 300 MHz
  • Each internal core cycle fetches 16 × 8 bits = 128 bits per chip

The burst buffer holds exactly this 128-bit fetch result, then serialises it to the 8-pin ×8 interface over 16 clock cycles.

TODO: Prefetch evolution diagram — show 1n through 16n. Each level: internal bus width doubles, external clock frequency stays fixed, internal core frequency halves. Illustrate how the prefetch buffer bridges internal and external domains.

The burst buffer

A 128-bit temporary register (burst buffer) is placed between the column multiplexer and each driver (one for reads, one for writes).

TODO: Circuit diagram — Address Input → Bank Group/Bank Control (×5) → Row Decoder (×16) → 65,536 rows → Sense Amplifiers (×8,192) → Column Multiplexer → Burst Buffer (read) + Burst Buffer (write) → Read Driver / Write Driver → ×8 data wires. Source:

Column address split (10 bits → 6 + 4)

FieldBitsRangePurpose
Multiplexer select60–63Selects 1 of 64 contiguous groups of 128 bitlines (64 × 128 = 8,192 bitlines total)
Burst position40–15Selects 1 of 16 eight-bit segments within the 128-bit burst buffer

The 128 selected bitlines must be contiguous — the multiplexer cannot select an arbitrary non-contiguous window.

How a burst read works

  1. 6-bit multiplexer select: Connects 128 contiguous bitlines to the burst buffer → loads 128 bits in one operation.
  2. 4-bit burst counter (0000 → 1111): Steps through the burst buffer 8 bits at a time → 16 consecutive 8-bit transfers to the read driver → out to the 8 data wires.

This is BL16 (burst length 16): one multiplexer command produces 16 data transfers = 128 bits total per chip.

Write works identically in reverse: the burst counter fills the 128-bit burst buffer from the write driver, then the multiplexer drives all 128 bits back to the selected bitlines simultaneously.

Bandwidth improvement

Without burst bufferWith burst buffer
Multiplexer positions for 8,192 columns8,192 ÷ 8 = 1,0248,192 ÷ 128 = 64
Transfers per multiplexer position1 × 8 bits16 × 8 bits
Multiplexer switching overheadbaseline16× lower

Cache line alignment

For a 32-bit DDR5 sub-channel (4 chips × 8 bits each):

  • Per burst: 128 bits per chip × 4 chips = 512 bits = 64 bytes
  • 64 bytes = one x86 cache line

A single BL16 burst fills exactly one CPU cache line. This is deliberate: the CPU always requests and evicts memory in cache-line units, so BL16 was chosen to match exactly, eliminating wasted partial transfers.

Burst Chop (BC8)

DDR5 supports halving the burst to 8 transfers (BC8):

  • 64 bits per chip instead of 128 bits (32 bytes per sub-channel instead of 64)
  • Useful when interleaving read and write commands at fine granularity, or for access patterns that don’t align to 64-byte boundaries
  • Burst chop can be issued mid-burst to terminate early

Column-to-column timing (tCCD)

Consecutive burst commands must be spaced by at least tCCD to allow the burst buffer to reload and the data bus to settle:

VariantWhen it appliesTypical DDR5
tCCD_S (short)Two CAS commands to different bank groups8–12 cycles
tCCD_L (long)Two CAS commands to the same bank group16–20 cycles

Different bank groups have independent I/O paths, allowing shorter tCCD between them. With 8 bank groups, up to 8 burst commands can be pipelined with tCCD_S gaps — substantially higher throughput than DDR4’s 4 bank groups.

Flexibility

The burst buffer does not force sequential access. If the next request targets:

  • A different 128-bit block in the same open row: the multiplexer loads a new block → new burst begins immediately
  • A different row (row miss): full PRE + ACT + burst sequence

See also