packages/chip/docs/architecture-optimization/sota-2028/cache-hierarchies.md
Sub-report of 2028-sota-integrated-report.md.
| SoC | Big L1I / L1D | Big private L2 | Mid L2 | Little L2 |
|---|---|---|---|---|
| Snapdragon 8 Elite Gen 5 (Oryon Gen 3, 2 Prime + 6 Perf) | 192 KB L1I / 128 KB L1D on Prime; 128 KB on Perf | shared 12 MB L2 / 2 Prime | shared 12 MB / 6 Perf | n/a |
| MediaTek Dimensity 9500 (1 C1-Ultra + 3 C1-Premium + 4 C1-Pro) | per-Arm | 2 MB C1-Ultra | 1 MB / C1-Premium | 512 KB / C1-Pro |
| Apple A19 Pro (2 P + 4 E) | 192/128 KB P; 128/64 KB E | 16 MB shared 2-P | n/a | 6 MB shared 4-E |
| Apple M5 (4 P + 6 E) | 192/128 P, 128/64 E | per-cluster L2 (multi-MB) | n/a | per-cluster E-L2 |
| Tensor G5 (1 X4 + 5 A725 + 2 A520) | Arm stock | 2 MB X4 | 1 MB A725 | 128-256 KB A520 |
| Exynos 2600 (1 + 3 + 6) | 64/128 KB C1-Ultra; 64/64 KB C1-Pro | 3 MB C1-Ultra | 1 MB C1-Pro | same |
| Arm Cortex-X925 (2025 IP) | 64/64 KB fixed | 2 MB 8-way or 3 MB 12-way | n/a | n/a |
| SoC | Cluster L3 | SLC | Total on-die SRAM |
|---|---|---|---|
| Snapdragon 8 Elite Gen 5 | none separate (per-cluster L2 instead) | 8 MB SLC | ~24 L2 + 8 SLC ≈ 32 MB |
| Dimensity 9500 | 16 MB DSU L3 | 10 MB SLC | 7 L2 + 16 L3 + 10 SLC ≈ 33 MB |
| Apple A19 Pro | none separate | 32 MB SLC | 16 P-L2 + 6 E-L2 + 32 SLC = 54 MB |
| Apple A19 | same | 12 MB SLC | 22 L2 + 12 SLC |
| Apple M5 | per-cluster | ~32 MB SLC | dozens of MB |
| Exynos 2600 | 16 MB L3 | unspec | ~28 MB |
| Snapdragon X2 Elite (PC ref) | per-cluster | 8 MB SLC at 228 GB/s memory | — |
| Prefetcher | Class | Best result | Notes |
|---|---|---|---|
| Stride / next-line | L1D, L1I | trivial baseline | universal floor |
| IPCP (ISCA'20) | L1D | ~6% IPC over baseline on SPEC | CS / CPLX / NL |
| Bingo (HPCA'19) | L1D | strong on irregular | short + long footprints |
| SPP (MICRO'16) | L2 | signature-based, scalable | DPC-3 winner-class |
| Berti (MICRO'22) | L1D | beats IPCP/Bingo on most SPEC | local-delta timing-aware |
| Pythia (MICRO'21) | hybrid | +3.4% over MLOP, +3.8% over Bingo, +4.3% over SPP | online RL agent |
| SPPAM (2026) | hybrid | +31.4% over no-prefetch, +6.2% over Berti+Pythia baseline | latest published |
| FDIP (1999, revisited 2020) | L1I front-end | covers most I-miss when BTB large | foundation for modern Arm/Intel |
| UDP / PDIP / DEER | L1I | improves FDIP for mobile/data-center large-footprint | DEER specifically targets modern mobile |
| Mockingjay (HPCA'22) | LLC replacement | +15.2% over LRU, beats Hawkeye +12.9% and SHiP +7.6% | Belady-MIN mimicry by expected hit count |
| Hawkeye (ISCA'16) | LLC | +12.9% over LRU | learns from past Belady oracle |
| DRRIP (ISCA'10) | LLC | classical, easy in RTL | set-dueling SRRIP vs BRRIP |
Mobile-vendor public prefetcher detail is thin. Apple uses multiple AMP/spatial/temporal prefetchers with deep lookahead; Arm DSU L3 ships with Best Offset Prefetch and stride; Snapdragon X-class Oryon publicly described as having aggressive PC-keyed stride+stream and a big BTB.
| SoC | Fabric | Protocol |
|---|---|---|
| Snapdragon 8 Elite Gen 5 | Qualcomm proprietary NoC | MOESI-class, snoop+directory hybrid |
| Dimensity 9500 / Exynos 2600 / Tensor | Arm DSU-120 / DSU + CI-700 (mobile CMN) | AMBA 5 CHI, MESI/MOESI-style |
| Apple A19 Pro / M5 | Apple proprietary | Directory-based, exclusive SLC on M-series |
| Open RISC-V (XiangShan Kunminghu) | CHI-style L2/L3 (CoupledL2) | MESI-equivalent |
| Chipyard / Rocket / BOOM | TileLink TL-C | MESI-equivalent |
packages/chipInspected files: docs/arch/{cpu-subsystem,memory-subsystem,interconnect}.md, rtl/memory/, rtl/interconnect/, rtl/cpu/.
rtl/memory/e1_axi_lite_dram.sv is 4 KiB SRAM-backed AXI-Lite, single-beat, aligned 32-bit only.rtl/interconnect/e1_axi_lite_interconnect.sv is single-port AXI-Lite mux with tiny decode map; CPU vs DMA arbitration is fixed CPU-wins. No bursts, IDs, atomics, cacheability attributes, TileLink / AXI4 semantics.rtl/memory/ contains exactly one file. rtl/interconnect/ contains two.Bottom line: zero implemented cache hierarchy. Memory traffic is one 32-bit beat against a 4 KiB SRAM masquerading as DRAM. CVA6 wrapper exists but is not wired into a multi-level cache.
| Property | Big OoO | Mid OoO | Little in-order |
|---|---|---|---|
| L1I size / assoc / line | 64 KB / 8-way / 64 B (stretch 96 KB) | 64 KB / 4-way / 64 B | 32 KB / 4-way / 64 B |
| L1D size / assoc / line | 64 KB / 8-way / 64 B (stretch 96 KB) | 64 KB / 4-way / 64 B | 32 KB / 4-way / 64 B |
| L1D bandwidth | 2× 128-bit R + 2× 128-bit W /cycle | 1× 128-bit R/W | 1× 64-bit R/W |
| Load-use latency | 4 cyc | 4 cyc | 3 cyc |
| Inclusion w.r.t. L2 | non-inclusive | non-inclusive | inclusive |
| L1I prefetcher | FDIP with ≥8K BTB, decoupled FTQ | FDIP-lite | next-line only |
| L1D prefetcher | Berti + IP-stride + next-line; optional Pythia | IPCP-lite | stride only |
| Hardware TLB | 64 L1I / 96 L1D, fully assoc; 4 KB/2 MB/1 GB | 48/64 | 32/32 |
64 KB L1 matches Cortex-X925 sweet spot and XiangShan Kunminghu targets. Apple-class 192 KB L1I costs area, energy, tag-check latency.
| Tool | Use |
|---|---|
| ChampSim | Prefetcher (IPCP, Bingo, SPP, Berti, Pythia, SPPAM) and replacement (LRU, DRRIP, Hawkeye, Mockingjay) sweeps on DPC-3 and SPEC CPU 2017 + GAP traces |
| gem5 RISC-V O3 | End-to-end IPC and miss-rate, full system, TLB and page-walk |
| XiangShan emu | Cross-check L1/L2 vs open RISC-V baseline (CoupledL2 + BOP) |
| CoMeT / Sniper | Mobile multi-program contention sweeps for SLC partitioning |
lat_mem_rd stride-walking pointer chase — canonical L1/L2/L3/SLC/DRAM latency curve.bw_mem — single-thread BW per cache level.perf c2c — cache-to-cache transfers, false/true sharing on coherence fabric.perf mem — load latency histograms, MPKI by source level.simpleperf (Android) — same counters under realistic AOSP workloads.Publish lmbench lat_mem_rd curves at five working-set points (1 KB, 64 KB, 1 MB, 16 MB, 256 MB), at three frequency points (idle, nominal, max), with explicit thermal state. Do not publish raw IPC vs Apple; publish per-MPKI normalized comparisons.
make cache-hierarchy-claim-gate — RTL exists at each level.make cocotb-cache-coherence — TL-C / CHI coherence vectors.make champsim-prefetch-sweep — DPC-3 sweep of upstream-bundled
prefetchers (no/next_line/ip_stride/spp_dev/va_ampm_lite); Berti /
IPCP / Bingo / BOP / Pythia remain BLOCKED until the CRC drop-ins are
ported. evidence_class=champsim_dpc3_traces_only.make mockingjay-vs-lru-sweep — DPC-3 LRU baseline + bundled
replacement deltas (lru/drrip/ship/srrip); Hawkeye / Mockingjay-prod
remain BLOCKED. evidence_class=champsim_dpc3_traces_only.make cocotb-cache-mockingjay-accuracy — Mockingjay-prod RTL vs LRU
oracle on synthetic scan+reuse stream.docs/project/uma-coherency-validation-strategy.yaml. IO-coherent vs non-coherent + explicit cache maintenance. Recommend IO-coherent for clean dma-buf story.docs/spec-db/process-14a-effects.yaml SS corner pushes L1/L2 cycle time up. Allocate guard cycles.