packages/chip/docs/architecture-optimization/sota-2028/memory-subsystem.md
Sub-report of 2028-sota-integrated-report.md.
| Standard | Per-pin rate | Per-pin BW | Per 16-bit channel | Notes |
|---|---|---|---|---|
| LPDDR5 | 6.4 Gbps | 0.8 GB/s | 12.8 GB/s | JESD209-5 |
| LPDDR5X (5C) | up to 10.7 Gbps | 1.34 GB/s | 21.4 GB/s | JESD209-5C Jun 2023; link-ECC |
| LPDDR5T (Samsung) | 10.7 Gbps | 1.34 GB/s | 21.4 GB/s | Vendor brand |
| LPDDR6 (JESD209-6) | 10.667 - 14.4 Gbps | 1.8 GB/s @ 14.4 | ~21.6 GB/s per 12-bit half-channel | Jul 9 2025; reduced IO voltage; link-ECC + on-die ECC baseline |
| LPDDR6 stretch | 14.4 - 17 Gbps post-1.0 | up to 2.1 GB/s | — | Trendforce/Cadence roadmap |
LPDDR5X retains LPDDR5's "two 16-bit sub-channels per x32 die"; LPDDR6 switches to a 24-bit channel split into two 12-bit sub-channels, integrates link-ECC + on-die ECC as baseline. Samsung 10.7 Gbps LPDDR6 on 12 nm class announced Nov 2025 with ~21% energy efficiency uplift vs LPDDR5X.
| SoC | DRAM | Bus | Peak BW | Capacity | SLC | IOMMU |
|---|---|---|---|---|---|---|
| Snapdragon 8 Elite Gen 5 | LPDDR5X-5300 | 4×16b = 64b | 84.8 GB/s | up to 24 GB | ~8 MiB SLC | Arm SMMU-700 |
| Snapdragon 8 Elite | LPDDR5X-9600 | 4×16b | ~76.8 GB/s | up to 24 GB | 8 MiB SLC | SMMU-700 |
| MediaTek Dimensity 9500 | LPDDR5X-10667 | 4×16b | ~85.3 GB/s | up to 16 GB | 10 MiB SLC + 16 MiB L3 | MediaTek MMU/SMMU |
| Apple A19 Pro | LPDDR5X-9600 | 4×16b | 75.8 - 76.8 GB/s | 12 GB | ~24 MiB SLC (third-party) | Apple UMA |
| Google Tensor G5 | LPDDR5X | 4×16b | ~68 GB/s class | 12-16 GB | undisclosed | Arm SMMU + Google IP |
| Xiaomi XRing O1 | LPDDR5T | 4×16b | ~85 GB/s class | flagship | undisclosed | unclear |
Every 2025-flagship Android-class AP runs a 64-bit total physical bus = 4 channels × 16-bit. Bandwidth uplift from Gen 5 vs Gen 4 and Dimensity 9500 (~85 GB/s) is driven entirely by data-rate (5300 → 9600/10667 MT/s) at fixed 64-bit width. Capacity uplift to 24 GB is from die density (32 Gb per die). SLCs sit between 8 and ~24 MiB.
| Layer | Open / Closed | Notes |
|---|---|---|
| LPDDR5X PHY (10.67 Gbps) | Closed: Synopsys DWC_LPDDR5X54X, Cadence LPDDR5X/4X, Rambus | Synopsys rated 8533-10667 Mbps; LPDDR6/5X PHY at 14.4 Gbps |
| LPDDR6 PHY (14.4 Gbps) | Closed: Cadence tape-out July 2025 (industry-first), Synopsys at 14.4 Gbps | Cadence: PHY with DFE/FFE/CTLE, DFI 5.0 controller |
| LPDDR controller open IP | LiteDRAM (LPDDR4 PHY by Antmicro 2020) | No production LPDDR5/5X/6 open PHY. CHIPS Alliance + Google Rowhammer test framework piggy-backs on LiteDRAM |
| RISC-V IOMMU | Ratified v1.0.1, 2024-09-11 | Per-device DC, PASID, page-request, fault queue, DTF bit; QEMU emulation merged 2024; Linux RISC-V IOMMU driver in -next |
| Arm SMMUv3.x | SMMUv3.4 in Armv9 | Mature Linux iommu/arm-smmu-v3 driver; SVA + I/O page-fault upstream since v5.3-5.5 |
| Coherent fabric | Arm CMN-S3 / CMN-700 (AMBA 5 CHI), TileLink-C, AXI4 + ACE | CMN-S3 is Arm's current Neoverse / mobile-server mesh; native AMBA-5 CHI |
| Display compression | AFBC (Arm), AFRC (random-access), ASTC | AFBC lossless, 50% BW reduction between GPU/VPU/DPU |
packages/chipRepo-grounded:
rtl/memory/e1_axi_lite_dram.sv: 1024 × 32-bit SRAM, single-beat AXI-Lite, one outstanding write + read, OKAY/SLVERR only, no bursts.rtl/interconnect/e1_axi_lite_interconnect.sv and e1_linux_soc_contract.sv: AXI-Lite 3-master (CPU / DMA / debug), fixed CPU-priority arbiter, 4 outstanding per master, 1024-cycle watchdog, decode-err sticky reg, single 256 MiB aperture at 0x8000_0000 but only 4 KiB implemented. No bursts. No IDs. No cache attributes. No coherency. No atomics. No QoS regs.docs/arch/memory-subsystem.md + docs/evidence/memory/uma-dram-evidence-gate.yaml: explicit fail-closed. Phase0 (4 KiB SRAM containment) current; phase1 (counters), phase2 (burst fabric), phase3 (UMA), phase4 (IOMMU), phase5 (LPDDR target) blocked.0x8000_0000 / 256 MiB; Verilator behavioural model, not controller/PHY.docs/arch/interconnect.md notes "not AXI4, not TileLink, not CHI, not ACE".docs/spec-db/process-14a-effects.yaml calls out 14a_sram_macro_vmin_ecc_evidence_missing.docs/architecture-optimization/soc-optimized-operating-point.yaml: presumes 240 GB/s sustained DRAM BW. That is higher than the gate's 120 GB/s; the optimizer runs off an aspirational number that the rest of the contract has not signed off.Bottom line: entire memory stack below the AXI-Lite scaffold is fictional from a silicon standpoint. The repo is honest about this. No PHY, no controller, no cache, no SLC, no coherency fabric, no IOMMU, no QoS arbiter, no ECC, no measurement target.
| Parameter | Minimum (must-ship) | Stretch (AI SKU) |
|---|---|---|
| Standard | LPDDR5X-10667 (JESD209-5C) | LPDDR6-14400 (JESD209-6) |
| Bus width at PHY | 4 ch × 16-bit = 64-bit (8 sub-ch × 8 byte-lanes) | 4 ch × 24-bit = 96-bit logical (8 sub-ch × 12-bit LPDDR6) |
| Peak bandwidth | 85.3 GB/s | 172.8 GB/s |
| Sustained target | ≥70 GB/s (~82% peak with display+camera+NPU contention) | ≥140 GB/s sustained |
| Capacity SKUs | 12 GiB (entry), 16 GiB (mid) | 24 GiB (AI) using 32 Gb dies ×4 |
| ECC | Mandatory on-die (LPDDR5X+) + link-ECC enabled | Plus optional inline parity for TEE/security regions |
| Refresh | Per-bank refresh; fine-grained tRFCab/tRFCpb knobs | Plus temperature-compensated refresh (TCSR) |
| Training | Full read/write leveling, gate training, vref, periodic ZQ cal | Plus per-byte-lane DFE/FFE training (LPDDR6) |
To hit the gate's 120 GB/s sustained / 180 GB/s peak, the stretch SKU is mandatory; LPDDR5X-10667 at 64-bit caps at 85.3 GB/s. Either downgrade gate to ~80 GB/s sustained on LPDDR5X SKU (120-180 GB/s reserved for LPDDR6 SKU), or widen bus to 128-bit (M-series / AI-PC territory, breaks phone power budget). Recommend split SKUs: baseline LPDDR5X 70 GB/s sustained; AI SKU LPDDR6 140 GB/s sustained.
Hardest open RISC-V question. No open LPDDR5X/6 PHY today. LiteDRAM tops out at LPDDR4.
Non-negotiable IP buy. The repo's docs/spec-db/mobile-sota-2026.yaml calls "custom LPDDR5X/LPDDR6 PHY" an explicit non-goal — promote to procurement gate.
| Block | Recommendation | Why |
|---|---|---|
| CPU↔LLC↔SLC fabric | AMBA-5 CHI (Arm CMN-S3 class) or open TileLink-C | CHI production standard; TileLink-C open path (SiFive/BOOM/Rocket). CHI faster; TileLink-C consistent with open story |
| NPU/GPU/ISP fabric | AXI4 with ACE-Lite (IO-coherent) into SLC | Avoids full snoop-in for read-many-write-rarely accelerator traffic |
| Display + camera VC | Dedicated VC / QoS class on NoC, latency-sensitive priority | Display underflow hard real-time |
| SLC size | 24 MiB (must-ship) / 32 MiB (AI SKU) | Matches A19 Pro / above D9500 (10 MiB) and S8E (8 MiB). At 14A/N2 SRAM density (~38 Mb/mm² N2 GAA) → 32 MiB ~0.7-1.0 mm² |
| SLC partitioning | Per-master way-allocation + pseudo-LRU + stash hints | NPU and camera benefit from explicit stash; CPU benefits from way-partition isolation |
| NoC topology | 2D-mesh CMN-S3-class, 4-6 home nodes, 2 memory home nodes | Matches LPDDR memory-controller count |
| Coherency directives | I/O-coherent DMA + NPU read paths; non-coherent + cache-maintenance for video/display writes | Hybrid is what Snapdragon/Dimensity actually do |
| Decision | Recommendation |
|---|---|
| Spec | RISC-V IOMMU v1.0.1 ratified (Sep 2024) for RISC-V-native path; SMMUv3.4-equivalent feature set required |
| Page-table format | Sv39 + Sv48 (4-level) compatible with RISC-V MMU; G-stage for virtualization |
| Streams | Per-device DC with PASID; IDs for NPU command-queue contexts, display planes, camera ISP pipelines, GPU contexts, DMA channels |
| Fault reporting | Fault queue with master/stream ID, IOVA, fault type, syndrome, PASID, page-request interface for SVA |
| Linux integration | RISC-V IOMMU driver in -next; Android requires dma-buf/iommu-v2 mapping ABI |
| Risk | Linux RISC-V IOMMU + QEMU still maturing (v6.x kernels). Plan upstream churn through 2026-2027 |
14a_sram_macro_vmin_ecc_evidence_missing blocker.| Metric | Tool | Pass threshold | Notes |
|---|---|---|---|
| Peak read BW | STREAM (Copy/Scale/Add/Triad) | ≥85% theoretical peak | -O3 -fopenmp; pin threads |
| Latency to DRAM | lmbench lat_mem_rd | ≤120 ns p95 random-read | Stride > LLC; defeat prefetch with random walk |
| Pointer-chase | lmbench random | curve L1 → L2 → L3 → SLC → DRAM | Plot working-set vs latency; verify each level |
| Sustained BW | bw_mem rd/wr/rdwr/cp/bzero | ≥120 GB/s stretch / ≥70 baseline | Multi-thread, per-channel NUMA-pinned |
| Mixed access | mlc (Intel) port or open equivalent | latency curve under BW load | Build using lmbench bw_mem + lat_mem_rd concurrently |
| Contended IO | fio random + sequential vs UFS while STREAM | UFS BW degrade ≤15% under DRAM saturation | UFS 4.x and DRAM share controller-side QoS |
| MLPerf Mobile | TFLite/ExecuTorch — MobileBERT, MobileNet, DeepLabv3, SSD, SD-XL (v6.0 added LLM/diffusion) | end-to-end latency + samples/s + thermal | MLPerf Inference v6.0 ran April 2026; single-stream + offline |
| Contended quad | NPU command queue + AFBC display 120 Hz QHD + camera ISP sim + dhrystone | display underflow 0; NPU TOPS drop ≤10%; CPU p99 bounded | Killer test; display underflow gate already named |
| Stale-buffer negative | dma-buf producer forgets cache-clean → consumer detects | must fault or be statically forbidden | Required by uma-coherency-validation-strategy |
| IOMMU fault | program unauthorized IOVA from NPU/DMA → expect fault queue entry | fault entry has master, IOVA, access, syndrome | Required by RISC-V IOMMU spec |
| Competitor | Best public number | Source |
|---|---|---|
| Snapdragon 8 Elite Gen 5 | 84.8 GB/s peak | Notebookcheck; Qualcomm product brief |
| Snapdragon 8 Elite | ~76.8 GB/s | Notebookcheck; chipsandcheese X2 Elite |
| Apple A19 Pro | 75.8-76.8 GB/s; latency ~115 ns | Notebookcheck; AppleWiki; chipsandcheese A17/A18 |
| Dimensity 9500 | ~85.3 GB/s peak; 10 MiB SLC + 16 MiB L3 | MediaTek; innoGyan |
| Tensor G5 | ~68 GB/s class | Android Central |
Use chipsandcheese latency curves and Anandtech BW plots as the public comparator.
0x8000_0000 / 256 MiB from Chipyard.| Category | Optimization | Why |
|---|---|---|
| PHY | Synopsys/Cadence LPDDR6/5X PHY at 14.4/10.67 Gbps with DFE/FFE/CTLE | Cannot self-design at this rate |
| Controller | Per-channel reorder queue, write-combining, refresh scheduler with PBR, page-policy heuristics, ZQ cal, on-die ECC + link-ECC | Memory controller table-stakes |
| Bus | AXI4 with bursts, IDs, exclusive monitors, ACE-Lite + CHI bridge | Required for SLC attach |
| SLC | 24-32 MiB, way-partitioned, stash-on-write hints from NPU/camera | Hides LPDDR latency from NPU bursts |
| NoC | CMN-S3-class or TileLink-C mesh with 2 memory home nodes per channel | Avoids single arbiter bottleneck |
| QoS | 4-class scheduler: display(RT) > camera > CPU > NPU > GPU > DMA-bulk; per-master BW meters; latency targets | Display underflow zero at 120/144 Hz QHD |
| IOMMU | RISC-V IOMMU v1.0.1 with G-stage, PASID, page-request, fault queue, ATS | Required by Android dma-buf + secure HAL |
| AFBC | AFBC 1.x or 2.0 on display + GPU + VPU | -50% display BW, free 30 GB/s headroom |
| NPU activation compression | Lossless tile-based on activations between L2 SRAM and DRAM | Mirrors MediaTek/Apple |
| Refresh | Per-bank refresh + temperature-compensated | -8% to -20% latency overhead recovery |
| Counters | Per-master read/write/error/latency-histogram via Linux EDAC + perf | Required by gate |
| ECC | On-die + link-ECC always-on, EDAC events to user-space, optional inline ECC for TEE | LPDDR5+ assumes this |
CACHE_PRELOAD hints from NPU compiler.mobile-sota-2026.yaml from non-goal to procurement requirement.uma-dram-evidence-gate.yaml 120 GB/s sustained / 180 GB/s peak; soc-optimized-operating-point.yaml 240 GB/s sustained. LPDDR5X-10667 × 64-bit caps at 85.3 GB/s peak. Split SKUs.mobile-sota-2026.yaml.soc-optimized-operating-point.yaml (240) with uma-dram-evidence-gate.yaml (120/180). Split SKUs.memory_roadmap_phases: burst-capable scaffold with AXI4 IDs + outstanding counters before coherency jump.riscv-non-isa/riscv-iommu) under verify/external/.process-14a-effects.yaml. Capture SRAM-wall with N3/N5/N2 bitcell numbers.compiler/runtime/. Mark all results simulator-only.