Back to Eliza

Memory Subsystem SOTA — 2028 RISC-V Phone-Class AP

packages/chip/docs/architecture-optimization/sota-2028/memory-subsystem.md

2.0.317.4 KB
Original Source

Memory Subsystem SOTA — 2028 RISC-V Phone-Class AP

Sub-report of 2028-sota-integrated-report.md.

A. SOTA snapshot

A.1 Memory technology

StandardPer-pin ratePer-pin BWPer 16-bit channelNotes
LPDDR56.4 Gbps0.8 GB/s12.8 GB/sJESD209-5
LPDDR5X (5C)up to 10.7 Gbps1.34 GB/s21.4 GB/sJESD209-5C Jun 2023; link-ECC
LPDDR5T (Samsung)10.7 Gbps1.34 GB/s21.4 GB/sVendor brand
LPDDR6 (JESD209-6)10.667 - 14.4 Gbps1.8 GB/s @ 14.4~21.6 GB/s per 12-bit half-channelJul 9 2025; reduced IO voltage; link-ECC + on-die ECC baseline
LPDDR6 stretch14.4 - 17 Gbps post-1.0up to 2.1 GB/sTrendforce/Cadence roadmap

LPDDR5X retains LPDDR5's "two 16-bit sub-channels per x32 die"; LPDDR6 switches to a 24-bit channel split into two 12-bit sub-channels, integrates link-ECC + on-die ECC as baseline. Samsung 10.7 Gbps LPDDR6 on 12 nm class announced Nov 2025 with ~21% energy efficiency uplift vs LPDDR5X.

A.2 Named competitor SoCs

SoCDRAMBusPeak BWCapacitySLCIOMMU
Snapdragon 8 Elite Gen 5LPDDR5X-53004×16b = 64b84.8 GB/sup to 24 GB~8 MiB SLCArm SMMU-700
Snapdragon 8 EliteLPDDR5X-96004×16b~76.8 GB/sup to 24 GB8 MiB SLCSMMU-700
MediaTek Dimensity 9500LPDDR5X-106674×16b~85.3 GB/sup to 16 GB10 MiB SLC + 16 MiB L3MediaTek MMU/SMMU
Apple A19 ProLPDDR5X-96004×16b75.8 - 76.8 GB/s12 GB~24 MiB SLC (third-party)Apple UMA
Google Tensor G5LPDDR5X4×16b~68 GB/s class12-16 GBundisclosedArm SMMU + Google IP
Xiaomi XRing O1LPDDR5T4×16b~85 GB/s classflagshipundisclosedunclear

Every 2025-flagship Android-class AP runs a 64-bit total physical bus = 4 channels × 16-bit. Bandwidth uplift from Gen 5 vs Gen 4 and Dimensity 9500 (~85 GB/s) is driven entirely by data-rate (5300 → 9600/10667 MT/s) at fixed 64-bit width. Capacity uplift to 24 GB is from die density (32 Gb per die). SLCs sit between 8 and ~24 MiB.

A.3 IP layer

LayerOpen / ClosedNotes
LPDDR5X PHY (10.67 Gbps)Closed: Synopsys DWC_LPDDR5X54X, Cadence LPDDR5X/4X, RambusSynopsys rated 8533-10667 Mbps; LPDDR6/5X PHY at 14.4 Gbps
LPDDR6 PHY (14.4 Gbps)Closed: Cadence tape-out July 2025 (industry-first), Synopsys at 14.4 GbpsCadence: PHY with DFE/FFE/CTLE, DFI 5.0 controller
LPDDR controller open IPLiteDRAM (LPDDR4 PHY by Antmicro 2020)No production LPDDR5/5X/6 open PHY. CHIPS Alliance + Google Rowhammer test framework piggy-backs on LiteDRAM
RISC-V IOMMURatified v1.0.1, 2024-09-11Per-device DC, PASID, page-request, fault queue, DTF bit; QEMU emulation merged 2024; Linux RISC-V IOMMU driver in -next
Arm SMMUv3.xSMMUv3.4 in Armv9Mature Linux iommu/arm-smmu-v3 driver; SVA + I/O page-fault upstream since v5.3-5.5
Coherent fabricArm CMN-S3 / CMN-700 (AMBA 5 CHI), TileLink-C, AXI4 + ACECMN-S3 is Arm's current Neoverse / mobile-server mesh; native AMBA-5 CHI
Display compressionAFBC (Arm), AFRC (random-access), ASTCAFBC lossless, 50% BW reduction between GPU/VPU/DPU

B. Current state in packages/chip

Repo-grounded:

  • rtl/memory/e1_axi_lite_dram.sv: 1024 × 32-bit SRAM, single-beat AXI-Lite, one outstanding write + read, OKAY/SLVERR only, no bursts.
  • rtl/interconnect/e1_axi_lite_interconnect.sv and e1_linux_soc_contract.sv: AXI-Lite 3-master (CPU / DMA / debug), fixed CPU-priority arbiter, 4 outstanding per master, 1024-cycle watchdog, decode-err sticky reg, single 256 MiB aperture at 0x8000_0000 but only 4 KiB implemented. No bursts. No IDs. No cache attributes. No coherency. No atomics. No QoS regs.
  • docs/arch/memory-subsystem.md + docs/evidence/memory/uma-dram-evidence-gate.yaml: explicit fail-closed. Phase0 (4 KiB SRAM containment) current; phase1 (counters), phase2 (burst fabric), phase3 (UMA), phase4 (IOMMU), phase5 (LPDDR target) blocked.
  • Chipyard generated AP gives SimDRAM at 0x8000_0000 / 256 MiB; Verilator behavioural model, not controller/PHY.
  • docs/arch/interconnect.md notes "not AXI4, not TileLink, not CHI, not ACE".
  • docs/spec-db/process-14a-effects.yaml calls out 14a_sram_macro_vmin_ecc_evidence_missing.
  • docs/architecture-optimization/soc-optimized-operating-point.yaml: presumes 240 GB/s sustained DRAM BW. That is higher than the gate's 120 GB/s; the optimizer runs off an aspirational number that the rest of the contract has not signed off.

Bottom line: entire memory stack below the AXI-Lite scaffold is fictional from a silicon standpoint. The repo is honest about this. No PHY, no controller, no cache, no SLC, no coherency fabric, no IOMMU, no QoS arbiter, no ECC, no measurement target.

C.1 External memory

ParameterMinimum (must-ship)Stretch (AI SKU)
StandardLPDDR5X-10667 (JESD209-5C)LPDDR6-14400 (JESD209-6)
Bus width at PHY4 ch × 16-bit = 64-bit (8 sub-ch × 8 byte-lanes)4 ch × 24-bit = 96-bit logical (8 sub-ch × 12-bit LPDDR6)
Peak bandwidth85.3 GB/s172.8 GB/s
Sustained target≥70 GB/s (~82% peak with display+camera+NPU contention)≥140 GB/s sustained
Capacity SKUs12 GiB (entry), 16 GiB (mid)24 GiB (AI) using 32 Gb dies ×4
ECCMandatory on-die (LPDDR5X+) + link-ECC enabledPlus optional inline parity for TEE/security regions
RefreshPer-bank refresh; fine-grained tRFCab/tRFCpb knobsPlus temperature-compensated refresh (TCSR)
TrainingFull read/write leveling, gate training, vref, periodic ZQ calPlus per-byte-lane DFE/FFE training (LPDDR6)

To hit the gate's 120 GB/s sustained / 180 GB/s peak, the stretch SKU is mandatory; LPDDR5X-10667 at 64-bit caps at 85.3 GB/s. Either downgrade gate to ~80 GB/s sustained on LPDDR5X SKU (120-180 GB/s reserved for LPDDR6 SKU), or widen bus to 128-bit (M-series / AI-PC territory, breaks phone power budget). Recommend split SKUs: baseline LPDDR5X 70 GB/s sustained; AI SKU LPDDR6 140 GB/s sustained.

C.2 PHY / controller IP path

Hardest open RISC-V question. No open LPDDR5X/6 PHY today. LiteDRAM tops out at LPDDR4.

  1. License Synopsys DWC LPDDR6/5X PHY + secure controller (DFI 5.0, up to 14.4 Gbps).
  2. License Cadence LPDDR6/5X PHY + controller (tape-out July 2025, industry-first 14.4 Gbps).
  3. License Rambus LPDDR5X PHY at 10.67 Gbps tier; LPDDR6 TBA.
  4. Foundry-supplied PHY bundled with 14A/N3/N2 PDK kit.
  5. Co-development with CHIPS Alliance LiteDRAM + LPDDR5 PHY — research only, not production.

Non-negotiable IP buy. The repo's docs/spec-db/mobile-sota-2026.yaml calls "custom LPDDR5X/LPDDR6 PHY" an explicit non-goal — promote to procurement gate.

C.3 SoC fabric and SLC

BlockRecommendationWhy
CPU↔LLC↔SLC fabricAMBA-5 CHI (Arm CMN-S3 class) or open TileLink-CCHI production standard; TileLink-C open path (SiFive/BOOM/Rocket). CHI faster; TileLink-C consistent with open story
NPU/GPU/ISP fabricAXI4 with ACE-Lite (IO-coherent) into SLCAvoids full snoop-in for read-many-write-rarely accelerator traffic
Display + camera VCDedicated VC / QoS class on NoC, latency-sensitive priorityDisplay underflow hard real-time
SLC size24 MiB (must-ship) / 32 MiB (AI SKU)Matches A19 Pro / above D9500 (10 MiB) and S8E (8 MiB). At 14A/N2 SRAM density (~38 Mb/mm² N2 GAA) → 32 MiB ~0.7-1.0 mm²
SLC partitioningPer-master way-allocation + pseudo-LRU + stash hintsNPU and camera benefit from explicit stash; CPU benefits from way-partition isolation
NoC topology2D-mesh CMN-S3-class, 4-6 home nodes, 2 memory home nodesMatches LPDDR memory-controller count
Coherency directivesI/O-coherent DMA + NPU read paths; non-coherent + cache-maintenance for video/display writesHybrid is what Snapdragon/Dimensity actually do

C.4 IOMMU / SMMU

DecisionRecommendation
SpecRISC-V IOMMU v1.0.1 ratified (Sep 2024) for RISC-V-native path; SMMUv3.4-equivalent feature set required
Page-table formatSv39 + Sv48 (4-level) compatible with RISC-V MMU; G-stage for virtualization
StreamsPer-device DC with PASID; IDs for NPU command-queue contexts, display planes, camera ISP pipelines, GPU contexts, DMA channels
Fault reportingFault queue with master/stream ID, IOVA, fault type, syndrome, PASID, page-request interface for SVA
Linux integrationRISC-V IOMMU driver in -next; Android requires dma-buf/iommu-v2 mapping ABI
RiskLinux RISC-V IOMMU + QEMU still maturing (v6.x kernels). Plan upstream churn through 2026-2027

C.5 ECC, refresh, training, reliability

  • On-die ECC mandatory (LPDDR5X+ enforces).
  • Link-ECC enabled in controller; counters via Linux EDAC.
  • Per-bank refresh with PBR scheduler prioritizing idle banks.
  • Temperature-compensated refresh from SoC thermal sensors.
  • Patrol scrub for TEE, keyslots.
  • MBIST + repair fuses for on-die SRAM, consistent with 14a_sram_macro_vmin_ecc_evidence_missing blocker.

D. Benchmarks, evaluation, testing

D.1 Mandatory measurement matrix

MetricToolPass thresholdNotes
Peak read BWSTREAM (Copy/Scale/Add/Triad)≥85% theoretical peak-O3 -fopenmp; pin threads
Latency to DRAMlmbench lat_mem_rd≤120 ns p95 random-readStride > LLC; defeat prefetch with random walk
Pointer-chaselmbench randomcurve L1 → L2 → L3 → SLC → DRAMPlot working-set vs latency; verify each level
Sustained BWbw_mem rd/wr/rdwr/cp/bzero≥120 GB/s stretch / ≥70 baselineMulti-thread, per-channel NUMA-pinned
Mixed accessmlc (Intel) port or open equivalentlatency curve under BW loadBuild using lmbench bw_mem + lat_mem_rd concurrently
Contended IOfio random + sequential vs UFS while STREAMUFS BW degrade ≤15% under DRAM saturationUFS 4.x and DRAM share controller-side QoS
MLPerf MobileTFLite/ExecuTorch — MobileBERT, MobileNet, DeepLabv3, SSD, SD-XL (v6.0 added LLM/diffusion)end-to-end latency + samples/s + thermalMLPerf Inference v6.0 ran April 2026; single-stream + offline
Contended quadNPU command queue + AFBC display 120 Hz QHD + camera ISP sim + dhrystonedisplay underflow 0; NPU TOPS drop ≤10%; CPU p99 boundedKiller test; display underflow gate already named
Stale-buffer negativedma-buf producer forgets cache-clean → consumer detectsmust fault or be statically forbiddenRequired by uma-coherency-validation-strategy
IOMMU faultprogram unauthorized IOVA from NPU/DMA → expect fault queue entryfault entry has master, IOVA, access, syndromeRequired by RISC-V IOMMU spec

D.2 Comparison data sources

CompetitorBest public numberSource
Snapdragon 8 Elite Gen 584.8 GB/s peakNotebookcheck; Qualcomm product brief
Snapdragon 8 Elite~76.8 GB/sNotebookcheck; chipsandcheese X2 Elite
Apple A19 Pro75.8-76.8 GB/s; latency ~115 nsNotebookcheck; AppleWiki; chipsandcheese A17/A18
Dimensity 9500~85.3 GB/s peak; 10 MiB SLC + 16 MiB L3MediaTek; innoGyan
Tensor G5~68 GB/s classAndroid Central

Use chipsandcheese latency curves and Anandtech BW plots as the public comparator.

E. Optimizations

Already present

  • Address decode containment for DMA.
  • CPU-priority arbitration with negative test for DMA over MMIO.
  • AXI-Lite watchdog and decode-err sticky reg.
  • Verilator SimDRAM at 0x8000_0000 / 256 MiB from Chipyard.

Required before 2028 phone-class claim

CategoryOptimizationWhy
PHYSynopsys/Cadence LPDDR6/5X PHY at 14.4/10.67 Gbps with DFE/FFE/CTLECannot self-design at this rate
ControllerPer-channel reorder queue, write-combining, refresh scheduler with PBR, page-policy heuristics, ZQ cal, on-die ECC + link-ECCMemory controller table-stakes
BusAXI4 with bursts, IDs, exclusive monitors, ACE-Lite + CHI bridgeRequired for SLC attach
SLC24-32 MiB, way-partitioned, stash-on-write hints from NPU/cameraHides LPDDR latency from NPU bursts
NoCCMN-S3-class or TileLink-C mesh with 2 memory home nodes per channelAvoids single arbiter bottleneck
QoS4-class scheduler: display(RT) > camera > CPU > NPU > GPU > DMA-bulk; per-master BW meters; latency targetsDisplay underflow zero at 120/144 Hz QHD
IOMMURISC-V IOMMU v1.0.1 with G-stage, PASID, page-request, fault queue, ATSRequired by Android dma-buf + secure HAL
AFBCAFBC 1.x or 2.0 on display + GPU + VPU-50% display BW, free 30 GB/s headroom
NPU activation compressionLossless tile-based on activations between L2 SRAM and DRAMMirrors MediaTek/Apple
RefreshPer-bank refresh + temperature-compensated-8% to -20% latency overhead recovery
CountersPer-master read/write/error/latency-histogram via Linux EDAC + perfRequired by gate
ECCOn-die + link-ECC always-on, EDAC events to user-space, optional inline ECC for TEELPDDR5+ assumes this

Optional but high-value

  • SLC compression (Apple-style cache-line compression).
  • Stash-on-write with explicit CACHE_PRELOAD hints from NPU compiler.
  • Memory-side bloom filter for dirty tracking.
  • Companion 64 MiB on-die SRAM tier for NPU activations.

F. Risks and open questions

  1. No open LPDDR5X/6 PHY. Must license Synopsys/Cadence/Rambus or take foundry-bundled PHY. Promote in mobile-sota-2026.yaml from non-goal to procurement requirement.
  2. PHY cost and area: 64-bit LPDDR5X PHY at 10.67 Gbps on N3/N2 ~5-7 mm²; license mid-7-figures + royalty. LPDDR6 14.4 Gbps more area for DFE/FFE.
  3. RISC-V IOMMU maturity: spec ratified Sep 2024, Linux driver merged ~v6.10-6.12, QEMU base 2024. No shipping Android phone with RISC-V IOMMU + tested HAL. Plan multi-quarter contributor work on Android IOMMU bindings, dma-buf v2, gralloc, NN HAL.
  4. Gate-vs-target inconsistency: uma-dram-evidence-gate.yaml 120 GB/s sustained / 180 GB/s peak; soc-optimized-operating-point.yaml 240 GB/s sustained. LPDDR5X-10667 × 64-bit caps at 85.3 GB/s peak. Split SKUs.
  5. SRAM-wall on 14A: N3 only ~5% SRAM density vs N5. N2 ~17% recovery via nanosheets (38 Mb/mm² macro). Plan SLC twice — N2-class (24-32 MiB), 14A-class (assume similar).
  1. Promote "custom LPDDR5X/LPDDR6 PHY" to procurement decision in mobile-sota-2026.yaml.
  2. Reconcile soc-optimized-operating-point.yaml (240) with uma-dram-evidence-gate.yaml (120/180). Split SKUs.
  3. Add phase1.5 to memory_roadmap_phases: burst-capable scaffold with AXI4 IDs + outstanding counters before coherency jump.
  4. Pull RISC-V IOMMU reference model (riscv-non-isa/riscv-iommu) under verify/external/.
  5. Define LPDDR PHY attach contract via DFI 5.0 boundary signals as controller's "south" interface (Synopsys + Cadence both speak DFI 5.0).
  6. Define dma-buf v2 + RISC-V IOMMU + AFBC Android stack as separate work order with its own evidence gate.
  7. Plan SLC sizing twice (N2 and 14A) under process-14a-effects.yaml. Capture SRAM-wall with N3/N5/N2 bitcell numbers.
  8. Build LPDDR-aware bandwidth/latency simulator (wrap DRAMSim3 or Ramulator2) under compiler/runtime/. Mark all results simulator-only.

Sources