Summary: DDR5’s 128-bit burst buffer implements 16n prefetch — the DRAM array fetches 128 bits per chip in one internal core cycle, then streams them as 16 consecutive 8-bit transfers, matching the x86 cache line size exactly and reducing column-multiplexer switching overhead by 16×.
The prefetch architecture
Prefetch depth (n) describes how many bits the DRAM internally fetches from the array per external I/O cycle. Higher prefetch depth decouples the internal array clock from the fast external interface, letting the array operate at a lower, more stable frequency.
| Generation | Prefetch | Internal core clock (DDR5-4800 equivalent) |
|---|---|---|
| SDR | 1n | = external I/O clock |
| DDR | 2n | ½ × external I/O clock |
| DDR2 | 4n | ¼ × external I/O clock |
| DDR3/DDR4 | 8n | ⅛ × external I/O clock |
| DDR5 | 16n | ¹⁄₁₆ × external I/O clock |
For DDR5-4800 (2,400 MHz external clock, since DDR doubles each edge):
- Internal array core runs at: 2,400 MHz ÷ 8 = 300 MHz
- Each internal core cycle fetches 16 × 8 bits = 128 bits per chip
The burst buffer holds exactly this 128-bit fetch result, then serialises it to the 8-pin ×8 interface over 16 clock cycles.
TODO: Prefetch evolution diagram — show 1n through 16n. Each level: internal bus width doubles, external clock frequency stays fixed, internal core frequency halves. Illustrate how the prefetch buffer bridges internal and external domains.
The burst buffer
A 128-bit temporary register (burst buffer) is placed between the column multiplexer and each driver (one for reads, one for writes).
TODO: Circuit diagram — Address Input → Bank Group/Bank Control (×5) → Row Decoder (×16) → 65,536 rows → Sense Amplifiers (×8,192) → Column Multiplexer → Burst Buffer (read) + Burst Buffer (write) → Read Driver / Write Driver → ×8 data wires. Source:
Column address split (10 bits → 6 + 4)
| Field | Bits | Range | Purpose |
|---|---|---|---|
| Multiplexer select | 6 | 0–63 | Selects 1 of 64 contiguous groups of 128 bitlines (64 × 128 = 8,192 bitlines total) |
| Burst position | 4 | 0–15 | Selects 1 of 16 eight-bit segments within the 128-bit burst buffer |
The 128 selected bitlines must be contiguous — the multiplexer cannot select an arbitrary non-contiguous window.
How a burst read works
- 6-bit multiplexer select: Connects 128 contiguous bitlines to the burst buffer → loads 128 bits in one operation.
- 4-bit burst counter (0000 → 1111): Steps through the burst buffer 8 bits at a time → 16 consecutive 8-bit transfers to the read driver → out to the 8 data wires.
This is BL16 (burst length 16): one multiplexer command produces 16 data transfers = 128 bits total per chip.
Write works identically in reverse: the burst counter fills the 128-bit burst buffer from the write driver, then the multiplexer drives all 128 bits back to the selected bitlines simultaneously.
Bandwidth improvement
| Without burst buffer | With burst buffer | |
|---|---|---|
| Multiplexer positions for 8,192 columns | 8,192 ÷ 8 = 1,024 | 8,192 ÷ 128 = 64 |
| Transfers per multiplexer position | 1 × 8 bits | 16 × 8 bits |
| Multiplexer switching overhead | baseline | 16× lower |
Cache line alignment
For a 32-bit DDR5 sub-channel (4 chips × 8 bits each):
- Per burst: 128 bits per chip × 4 chips = 512 bits = 64 bytes
- 64 bytes = one x86 cache line
A single BL16 burst fills exactly one CPU cache line. This is deliberate: the CPU always requests and evicts memory in cache-line units, so BL16 was chosen to match exactly, eliminating wasted partial transfers.
Burst Chop (BC8)
DDR5 supports halving the burst to 8 transfers (BC8):
- 64 bits per chip instead of 128 bits (32 bytes per sub-channel instead of 64)
- Useful when interleaving read and write commands at fine granularity, or for access patterns that don’t align to 64-byte boundaries
- Burst chop can be issued mid-burst to terminate early
Column-to-column timing (tCCD)
Consecutive burst commands must be spaced by at least tCCD to allow the burst buffer to reload and the data bus to settle:
| Variant | When it applies | Typical DDR5 |
|---|---|---|
| tCCD_S (short) | Two CAS commands to different bank groups | 8–12 cycles |
| tCCD_L (long) | Two CAS commands to the same bank group | 16–20 cycles |
Different bank groups have independent I/O paths, allowing shorter tCCD between them. With 8 bank groups, up to 8 burst commands can be pipelined with tCCD_S gaps — substantially higher throughput than DDR4’s 4 bank groups.
Flexibility
The burst buffer does not force sequential access. If the next request targets:
- A different 128-bit block in the same open row: the multiplexer loads a new block → new burst begins immediately
- A different row (row miss): full PRE + ACT + burst sequence
See also
- dram-read-write-refresh — the column-select step that the burst buffer optimises
- dram-row-hits-and-latency — row-side latency and the tCCD constraint; bank group interleaving
- dram-subarrays — the physical counterpart optimisation on the bitline side
Sources
- Branch Education — How Does Computer Memory Work?
- Wikipedia — DDR5 SDRAM
- KAD8 — DDR Memory Fundamentals: Architecture, Prefetch, and Addressing
- Micron — DDR5 SDRAM New Features White Paper
