Back to Eliza

Cache hierarchy contract

packages/chip/docs/arch/cache-hierarchy.md

2.0.38.6 KB
Original Source

Cache hierarchy contract

This document is the contract for the executable cache-hierarchy RTL in rtl/cache/. It complements docs/arch/cpu-subsystem.md, docs/arch/memory-subsystem.md, and docs/arch/interconnect.md. The benchmarking and BLOCKED-claim contract for this work lives at docs/evidence/cache/cache-evidence-gate.yaml and is enforced by scripts/check_cache_hierarchy.py.

The cache hierarchy is the on-die SRAM that hides DRAM latency. Without this RTL the SoC has one tiny SRAM behind AXI-Lite; with this RTL the SoC has a four-level hierarchy (L1I, L1D, private L2, shared L3) plus a multi-bank SLC and a BDI compression path, all sized to the 2028 phone-class minimums.

Geometry

LevelSizeWaysLineSets / bankBanksLatency (cyc)Notes
L1I64 KB864 B12814 (load-use)Parity per line, FDIP prefetch
L1D64 KB864 B12884 (load-use)SECDED ECC, 2R/2W banked
L21 MB864 B2048112MESI, inclusive of L1I tags, PTW data port
L38 MB1664 B81924~25MESI directory, DRRIP/Hawkeye/Mockingjay
SLC16 MB1664 B163844~50Per-client QoS, way-partition, BDI compression

Each size and bank count is a module parameter (SIZE_BYTES, WAYS, LINE_BYTES, BANKS). Halving the L2 to 512 KB or shrinking the SLC to 8 MB for a smaller variant is one parameter override.

2028 phone-class minimums enforced by the claim gate:

  • L1I ≥ 32 KB
  • L1D ≥ 32 KB
  • L2 ≥ 256 KB
  • L3 ≥ 4 MB
  • SLC ≥ 8 MB

Stretch targets (Apple-class) are not gated:

  • L1I/L1D 96 KB
  • L2 2 MB on the Ultra big core
  • L3 16 MB
  • SLC 32 MB

Files

rtl/cache/
  cache_pkg.sv                 shared parameters and helpers
  ftq_to_l1i_pkg.sv            BPU FTQ -> L1I prefetch interface
  lsu_to_l1d_pkg.sv            OoO LSU -> L1D 2R/2W interface
  l1i/e1_l1i_cache.sv          read-only L1I with FDIP prefetch
  l1d/e1_l1d_cache.sv          2R/2W L1D with SECDED + MESI
  l2/e1_l2_cache.sv            private L2 with PTW port
  l3/e1_l3_cache.sv            shared L3 with directory + DRRIP
  slc/e1_slc.sv                SLC with QoS + BDI + way partition
  prefetch/e1_berti_prefetcher.sv
  prefetch/e1_fdip_l1i_prefetcher.sv
  prefetch/e1_stride_prefetcher.sv
  prefetch/e1_best_offset_prefetcher.sv
  prefetch/e1_spp_prefetcher.sv
  prefetch/e1_ipcp_prefetcher.sv
  prefetch/e1_pythia_stub.sv          BLOCKED stub; real RTL is follow-on
  replacement/e1_drrip.sv             cheap MVP
  replacement/e1_hawkeye.sv           fallback option
  replacement/e1_mockingjay.sv        primary academic-quality port
  compression/e1_bdi_compress.sv
  compression/e1_bdi_decompress.sv
  coherence/tl_c_to_chi_bridge.sv     TL-C plane -> AXI4/CHI south boundary

Coordination interfaces

BPU ↔ L1I

The BPU runs a decoupled Fetch Target Queue ahead of the IFU. FTQ writes prefetch requests; the L1I consumes them.

ftq_to_l1i_pkg::ftq_prefetch_req_t = {
  paddr_line[39:0],      // 64 B-aligned
  confidence[2:0],       // 0..7
  branch_target          // 1 if FTQ entry originates from a branch target
}

Single-cycle handshake. The L1I drops in-flight prefetches on ifu_flush. In-progress demand line fills are not aborted by flush. The BPU agent owns the FTQ producer side and never modifies the L1I; the cache agent owns the consumer side and never modifies the FTQ. Both sides import the same package.

LSU ↔ L1D

lsu_to_l1d_pkg::lsu_l1d_req_t = {
  paddr[39:0], size[2:0], is_load, wdata[127:0], wstrb[15:0], tag[7:0]
}
lsu_to_l1d_pkg::lsu_l1d_resp_t = {
  rdata[127:0], tag[7:0], ack, replay, ecc_uncorrectable
}

Two request ports (p0, p1). Bank conflict on the same paddr[6:4] causes p1 to replay. ECC double-bit errors surface as ecc_uncorrectable=1 plus a replay; single-bit errors are corrected silently with an hpm_l1d_ecc_corr pulse.

L2 ↔ L3 ↔ SLC ↔ DRAM

TileLink TL-C-class messages flow inside the cluster (L1↔L2↔L3↔SLC). At the SLC↔DRAM boundary, tl_c_to_chi_bridge converts TL acquire/release to AXI4 AR/AW/W/R/B with 8-byte beats. The memory agent owns everything south of that bridge.

Coherence

  • Protocol: MESI on TileLink TL-C inside the cluster.
  • Directory: distributed at L3, snoop filter per slice, full inclusion of L2 tags.
  • Probes: MESI_S (downgrade M→S, write back dirty) and MESI_I (invalidate, write back dirty if M).
  • L1I never holds dirty data; on MESI_I probes it invalidates and acks without writeback.

QoS at SLC

e1_cache_pkg::qos_class_e defines eight classes; lower numeric value wins. The SLC arbiter guarantees that under saturation it services at least one QOS_DISPLAY_RT request every display_window_cycles. Way allocation per QoS class is programmable via way_alloc_mask. Way shutoff for DVFS is programmable per bank via way_enable_mask.

ClassNumericAllowed clients
QOS_DISPLAY_RT0Display real-time
QOS_CAMERA_ISP1Camera, ISP
QOS_CPU_FG2CPU foreground threads
QOS_CPU_BG3CPU background, writebacks
QOS_NPU4NPU tensor streaming
QOS_GPU5GPU / 2D rasterizer
QOS_DMA_BULK6Peripheral DMA, USB, NVMe
QOS_LOW7Background / non-time-sensitive

BDI compression at SLC

Five compressed forms are supported (Pekhimenko et al., PACT'12):

FormEncodingPayloadBytes vs 64 B line
BDI_ZEROall-zero linenone0
BDI_REPEAT8 B base repeated8 B8
BDI_B8D18 B base + 8 × 1 B signed delta16 B16
BDI_B8D28 B base + 8 × 2 B signed delta24 B24
BDI_NONEuncompressed64 B64

L1 and L2 do not compress (latency tax). Only SLC.

Replacement

DRRIP is the default for L3 and SLC. The L3 module parameter REPLACEMENT_POLICY selects DRRIP/Hawkeye/Mockingjay/LRU. Mockingjay is the primary academic-quality port, validated functionally against a tiny Belady oracle in the cocotb harness, but its productized form requires follow-on work; see docs/evidence/cache/cache-evidence-gate.yaml.

HPM events

The cache hierarchy emits 1-cycle pulses on hpm_* signals. The CPU's HPM aggregator owns the counter registers. Event codes are declared in e1_cache_pkg::HPM_* and reserved for the cache hierarchy at the Zihpm-class boundary.

CodeEvent
0L1I access
1L1I miss
2L1I useful prefetch
3L1D access
4L1D miss
5L1D useful prefetch
6L1D ECC single-bit corrected
7L1D ECC double-bit uncorrectable
8L2 access
9L2 miss
10L2 prefetch
11L3 access
12L3 miss
13L3 snoop hit (probe forwarded)
14L3 writeback
15SLC access
16SLC miss
17SLC way shutoff active
18SLC BDI compression hit
19SLC display realtime hold

Verification

TargetCoverage
make rtl-checkVerilator lint of every cache module
make cocotb-cache-coherenceMESI transitions, single-writer-multi-reader
make champsim-prefetch-sweepDPC-3 sweep of upstream ChampSim prefetchers (no/next_line/ip_stride/spp_dev/va_ampm_lite). Berti/IPCP/Bingo/BOP/Pythia remain BLOCKED until ported. evidence_class=champsim_dpc3_traces_only
make mockingjay-vs-lru-sweepDPC-3 LRU baseline + bundled replacement deltas (drrip/ship/srrip). Hawkeye/Mockingjay-prod remain BLOCKED until ported. evidence_class=champsim_dpc3_traces_only
make cocotb-cache-mockingjay-accuracyMockingjay-prod RTL hit-rate vs LRU on synthetic scan+reuse stream (currently below +10% threshold; see mockingjay_cocotb_synthetic_report.json)
make lmbench-cache-curvelat_mem_rd canonical L1/L2/L3/SLC/DRAM curve (functional in sim; real-target evidence BLOCKED)
make cache-hierarchy-claim-gateStatic gate enforcing 2028 minimums and blocked-claim discipline

Claim boundary

This RTL is a synthesizable, Verilator-runnable cache hierarchy for pre-silicon evaluation, simulation, and ChampSim/gem5 cross-checking. It is not silicon, it is not measured on a real target, and it is not evidence of phone-class IPC or latency. Phone-class claims remain BLOCKED until the gate at docs/evidence/cache/cache-evidence-gate.yaml records measured evidence from real silicon or a full-system simulator with traceable provenance.

make cache-hierarchy-claim-gate enforces this boundary. Any addition of a phone-class claim must replace the corresponding BLOCKED entry with a measured evidence artifact; the gate fails closed otherwise.