packages/chip/docs/arch/cache-hierarchy.md
This document is the contract for the executable cache-hierarchy RTL in
rtl/cache/. It complements docs/arch/cpu-subsystem.md,
docs/arch/memory-subsystem.md, and docs/arch/interconnect.md. The
benchmarking and BLOCKED-claim contract for this work lives at
docs/evidence/cache/cache-evidence-gate.yaml and is enforced by
scripts/check_cache_hierarchy.py.
The cache hierarchy is the on-die SRAM that hides DRAM latency. Without this RTL the SoC has one tiny SRAM behind AXI-Lite; with this RTL the SoC has a four-level hierarchy (L1I, L1D, private L2, shared L3) plus a multi-bank SLC and a BDI compression path, all sized to the 2028 phone-class minimums.
| Level | Size | Ways | Line | Sets / bank | Banks | Latency (cyc) | Notes |
|---|---|---|---|---|---|---|---|
| L1I | 64 KB | 8 | 64 B | 128 | 1 | 4 (load-use) | Parity per line, FDIP prefetch |
| L1D | 64 KB | 8 | 64 B | 128 | 8 | 4 (load-use) | SECDED ECC, 2R/2W banked |
| L2 | 1 MB | 8 | 64 B | 2048 | 1 | 12 | MESI, inclusive of L1I tags, PTW data port |
| L3 | 8 MB | 16 | 64 B | 8192 | 4 | ~25 | MESI directory, DRRIP/Hawkeye/Mockingjay |
| SLC | 16 MB | 16 | 64 B | 16384 | 4 | ~50 | Per-client QoS, way-partition, BDI compression |
Each size and bank count is a module parameter (SIZE_BYTES, WAYS,
LINE_BYTES, BANKS). Halving the L2 to 512 KB or shrinking the SLC to
8 MB for a smaller variant is one parameter override.
2028 phone-class minimums enforced by the claim gate:
Stretch targets (Apple-class) are not gated:
rtl/cache/
cache_pkg.sv shared parameters and helpers
ftq_to_l1i_pkg.sv BPU FTQ -> L1I prefetch interface
lsu_to_l1d_pkg.sv OoO LSU -> L1D 2R/2W interface
l1i/e1_l1i_cache.sv read-only L1I with FDIP prefetch
l1d/e1_l1d_cache.sv 2R/2W L1D with SECDED + MESI
l2/e1_l2_cache.sv private L2 with PTW port
l3/e1_l3_cache.sv shared L3 with directory + DRRIP
slc/e1_slc.sv SLC with QoS + BDI + way partition
prefetch/e1_berti_prefetcher.sv
prefetch/e1_fdip_l1i_prefetcher.sv
prefetch/e1_stride_prefetcher.sv
prefetch/e1_best_offset_prefetcher.sv
prefetch/e1_spp_prefetcher.sv
prefetch/e1_ipcp_prefetcher.sv
prefetch/e1_pythia_stub.sv BLOCKED stub; real RTL is follow-on
replacement/e1_drrip.sv cheap MVP
replacement/e1_hawkeye.sv fallback option
replacement/e1_mockingjay.sv primary academic-quality port
compression/e1_bdi_compress.sv
compression/e1_bdi_decompress.sv
coherence/tl_c_to_chi_bridge.sv TL-C plane -> AXI4/CHI south boundary
The BPU runs a decoupled Fetch Target Queue ahead of the IFU. FTQ writes prefetch requests; the L1I consumes them.
ftq_to_l1i_pkg::ftq_prefetch_req_t = {
paddr_line[39:0], // 64 B-aligned
confidence[2:0], // 0..7
branch_target // 1 if FTQ entry originates from a branch target
}
Single-cycle handshake. The L1I drops in-flight prefetches on ifu_flush.
In-progress demand line fills are not aborted by flush. The BPU agent owns
the FTQ producer side and never modifies the L1I; the cache agent owns the
consumer side and never modifies the FTQ. Both sides import the same
package.
lsu_to_l1d_pkg::lsu_l1d_req_t = {
paddr[39:0], size[2:0], is_load, wdata[127:0], wstrb[15:0], tag[7:0]
}
lsu_to_l1d_pkg::lsu_l1d_resp_t = {
rdata[127:0], tag[7:0], ack, replay, ecc_uncorrectable
}
Two request ports (p0, p1). Bank conflict on the same paddr[6:4] causes p1
to replay. ECC double-bit errors surface as ecc_uncorrectable=1 plus a
replay; single-bit errors are corrected silently with an
hpm_l1d_ecc_corr pulse.
TileLink TL-C-class messages flow inside the cluster (L1↔L2↔L3↔SLC). At
the SLC↔DRAM boundary, tl_c_to_chi_bridge converts TL acquire/release to
AXI4 AR/AW/W/R/B with 8-byte beats. The memory agent owns everything south
of that bridge.
MESI_S (downgrade M→S, write back dirty) and MESI_I
(invalidate, write back dirty if M).MESI_I probes it invalidates and acks
without writeback.e1_cache_pkg::qos_class_e defines eight classes; lower numeric value
wins. The SLC arbiter guarantees that under saturation it services at
least one QOS_DISPLAY_RT request every display_window_cycles. Way
allocation per QoS class is programmable via way_alloc_mask. Way
shutoff for DVFS is programmable per bank via way_enable_mask.
| Class | Numeric | Allowed clients |
|---|---|---|
QOS_DISPLAY_RT | 0 | Display real-time |
QOS_CAMERA_ISP | 1 | Camera, ISP |
QOS_CPU_FG | 2 | CPU foreground threads |
QOS_CPU_BG | 3 | CPU background, writebacks |
QOS_NPU | 4 | NPU tensor streaming |
QOS_GPU | 5 | GPU / 2D rasterizer |
QOS_DMA_BULK | 6 | Peripheral DMA, USB, NVMe |
QOS_LOW | 7 | Background / non-time-sensitive |
Five compressed forms are supported (Pekhimenko et al., PACT'12):
| Form | Encoding | Payload | Bytes vs 64 B line |
|---|---|---|---|
BDI_ZERO | all-zero line | none | 0 |
BDI_REPEAT | 8 B base repeated | 8 B | 8 |
BDI_B8D1 | 8 B base + 8 × 1 B signed delta | 16 B | 16 |
BDI_B8D2 | 8 B base + 8 × 2 B signed delta | 24 B | 24 |
BDI_NONE | uncompressed | 64 B | 64 |
L1 and L2 do not compress (latency tax). Only SLC.
DRRIP is the default for L3 and SLC. The L3 module parameter
REPLACEMENT_POLICY selects DRRIP/Hawkeye/Mockingjay/LRU. Mockingjay is
the primary academic-quality port, validated functionally against a tiny
Belady oracle in the cocotb harness, but its productized form requires
follow-on work; see docs/evidence/cache/cache-evidence-gate.yaml.
The cache hierarchy emits 1-cycle pulses on hpm_* signals. The CPU's
HPM aggregator owns the counter registers. Event codes are declared in
e1_cache_pkg::HPM_* and reserved for the cache hierarchy at the
Zihpm-class boundary.
| Code | Event |
|---|---|
| 0 | L1I access |
| 1 | L1I miss |
| 2 | L1I useful prefetch |
| 3 | L1D access |
| 4 | L1D miss |
| 5 | L1D useful prefetch |
| 6 | L1D ECC single-bit corrected |
| 7 | L1D ECC double-bit uncorrectable |
| 8 | L2 access |
| 9 | L2 miss |
| 10 | L2 prefetch |
| 11 | L3 access |
| 12 | L3 miss |
| 13 | L3 snoop hit (probe forwarded) |
| 14 | L3 writeback |
| 15 | SLC access |
| 16 | SLC miss |
| 17 | SLC way shutoff active |
| 18 | SLC BDI compression hit |
| 19 | SLC display realtime hold |
| Target | Coverage |
|---|---|
make rtl-check | Verilator lint of every cache module |
make cocotb-cache-coherence | MESI transitions, single-writer-multi-reader |
make champsim-prefetch-sweep | DPC-3 sweep of upstream ChampSim prefetchers (no/next_line/ip_stride/spp_dev/va_ampm_lite). Berti/IPCP/Bingo/BOP/Pythia remain BLOCKED until ported. evidence_class=champsim_dpc3_traces_only |
make mockingjay-vs-lru-sweep | DPC-3 LRU baseline + bundled replacement deltas (drrip/ship/srrip). Hawkeye/Mockingjay-prod remain BLOCKED until ported. evidence_class=champsim_dpc3_traces_only |
make cocotb-cache-mockingjay-accuracy | Mockingjay-prod RTL hit-rate vs LRU on synthetic scan+reuse stream (currently below +10% threshold; see mockingjay_cocotb_synthetic_report.json) |
make lmbench-cache-curve | lat_mem_rd canonical L1/L2/L3/SLC/DRAM curve (functional in sim; real-target evidence BLOCKED) |
make cache-hierarchy-claim-gate | Static gate enforcing 2028 minimums and blocked-claim discipline |
This RTL is a synthesizable, Verilator-runnable cache hierarchy for
pre-silicon evaluation, simulation, and ChampSim/gem5 cross-checking. It
is not silicon, it is not measured on a real target, and it is not
evidence of phone-class IPC or latency. Phone-class claims remain
BLOCKED until the gate at docs/evidence/cache/cache-evidence-gate.yaml
records measured evidence from real silicon or a full-system simulator
with traceable provenance.
make cache-hierarchy-claim-gate enforces this boundary. Any addition
of a phone-class claim must replace the corresponding BLOCKED entry
with a measured evidence artifact; the gate fails closed otherwise.