packages/chip/docs/architecture-optimization/2028-sota-integrated-report.md
Open RISC-V Android phone SoC (e1 → 2028 product). Cross-domain synthesis of eight specialist research reports under sota-2028/. Dated 2026-05-19.
This is a planning and claim-boundary artifact, not an implementation claim. Every numeric target here is gated by the existing fail-closed evidence pipeline (docs/architecture-optimization/cpu-npu-2028-readiness-scorecard.yaml, docs/evidence/, pd/signoff/).
The 2028 phone-class envelope is concrete and well-bounded by the 2025-2026 flagship cohort: Apple A19 Pro, Snapdragon 8 Elite Gen 5, Dimensity 9500, Exynos 2600, Tensor G5. Across every domain the same shape repeats:
| Domain | Snapdragon 8 Elite Gen 5 | Dimensity 9500 | Apple A19 Pro | Exynos 2600 | Tensor G5 |
|---|---|---|---|---|---|
| Process | TSMC N3P | TSMC N3P | TSMC N3P | Samsung SF2 (2 nm GAA) | TSMC N3 |
| Big core | Oryon Gen 3 Prime 4.74 GHz | C1-Ultra 4.21 GHz | A19 Pro P 4.26 GHz | C1-Ultra 3.8 GHz | Cortex-X4 3.78 GHz |
| Topology | 2+6 | 1+3+4 | 2+4 | 1+3+6 (reported) | 1+5+2 |
| Decode width (big) | 9-wide | 10-wide | ~9-wide native | 10-wide | 10-wide |
| ROB (big) | ~650 | ~2000 inflight | ~700 | similar to C1-Ultra | ~525 effective |
| L1I/L1D (big) | 192/128 KB | per Arm ref | ~192/128 KB | 64/128 KB | 64/64 KB |
| Private L2 (big) | 12 MB shared 2-Prime | 2 MB / core | 16 MB 2-P cluster | 3 MB | 2 MB |
| L3 cluster | — | 16 MB DSU L3 | — | 16 MB | DSU L3 |
| SLC | 8 MB | 10 MB | 32 MB | unspec | ~8 MB |
| DRAM | LPDDR5X-5300, 84.8 GB/s, ≤24 GB | LPDDR5X-10667, ~85 GB/s, ≤16 GB | LPDDR5X-9600, 75.8 GB/s, 12 GB | LPDDR5X-class | LPDDR5X, 12-16 GB |
| Coherent fabric | Qualcomm NoC (closed) | Arm DSU-120 + CI-700 (CHI) | Apple proprietary | Arm DSU + CMN | Arm DSU |
| IOMMU | Arm SMMU-700 | Arm SMMU | Apple IOMMU | Arm SMMU | Arm SMMU |
| BPU L1 BTB / RAS | ~2K + 2048-entry indirect / 48 | DSU+Arm BPU | ~44K total TAGE entries | similar to C1-Ultra | Arm BPU |
| Geekbench 6 ST/MT | 3649 / 11068 | 3502 / 10417 | 3834-3895 / 9988 | 3197 / 11065 | 2288 / 6030 |
| PMIC rails | 30-40 LDO + ~14 SMPS across 6-8 dies | MT6373+MT6363 (~11 SMPS + 20 LDO) | custom (~12-18 rails internal) | S2MPS27 | reused Samsung PMIC |
| BSPDN | no | no | no | no | no |
No 2025-2026 flagship ships BSPDN. Apple, Qualcomm, MediaTek, Samsung, Google all use frontside power delivery on TSMC N3P / Samsung SF2. BSPDN first appears at Intel 18A (Dec 2025 HVM) and TSMC A16 (2027 HVM). TSMC's mobile A14 in 2028 is frontside-only; the BSPDN A14 variant ("A12") slips to 2029. For a 2028 phone product, frontside PDN is the safe call.
Consolidated hard numbers. Where two appear (MVP / stretch), MVP is the baseline SKU and stretch is the AI-heavy SKU.
| Spec | Ultra big | Premium mid | Pro little |
|---|---|---|---|
| ISA | RVA23 + V (VLEN=256) + Zfh/Zvfh/Zvbb/Zvkt + Zicfilp/Zicfiss + Ztso (PTE-bit) + Sv57 | RVA23+V (VLEN=128) | RVA23 |
| Width | 8-wide decode (+2 fused) | 6-wide decode | 1-2-wide in-order |
| ROB / inflight | 512 | 256 | n/a |
| PRF (INT/FP-V) | 400 / 400 (256-bit V regs) | 192 / 192 | small |
| Branch predictor | TAGE-SC-L + ITTAGE + uFTB 512 + FTB 8K + L2 BTB 16K + RAS 32/64 + SC + loop, 2 taken/cycle, ≤14 cyc penalty | XiangShan-class TAGE-SC | bimodal+BTB+RAS |
| L1I / L1D | 64 KB 8-way / 64 KB 8-way (stretch 96 KB) | 64 / 64 KB 4-way | 32 / 32 KB 4-way |
| Private L2 | 2 MB 8-way | 1 MB 8-way | 256-512 KB 4-way |
| TLB | 64 L1 + 2048 L2 unified, Sv48/57, 4K/2M/1G pages | smaller | smaller |
| Memory ordering | RVWMO + Ztso per-page | RVWMO | RVWMO |
| Clock | 4.0-4.3 GHz typical, 4.5 burst | 3.2-3.4 GHz | 2.4-2.6 GHz |
| IPC SPEC2017int | ≥ 9 | ~5 | ~1.6 |
| Area (N2P class) | ~2.0 mm² incl L2 | ~0.85 mm² | ~0.25 mm² |
| Power | 1.4 W sustained / 2.8 W burst | 0.8 W / 1.6 W | 0.3 W / 0.7 W |
| Recommended open base | fork Tenstorrent Ascalon-D8 (Apache, 8-wide, LLVM-upstream); fallback XiangShan Kunminghu V3 scaled to 8-wide | XiangShan Kunminghu V3 (Mulan PSL v2) | CVA6 (Apache/Solderpad) |
Management hart: Ibex (lowRISC, Apache-2.0) on AON rail for boot, security, PMU, RPMI firmware.
soc-optimized-operating-point.yaml presumes 240 GB/s; uma-dram-evidence-gate.yaml requires 120 GB/s; LPDDR5X 64-bit caps at 85 GB/s. Split SKUs.| Item | 2028 commitment |
|---|---|
| Process primary | TSMC N2P (HVM 2H 2026, mobile-mature by 2028, frontside PDN) |
| Process stretch | TSMC A14 baseline (HVM 2028, frontside) or Intel 14A (BSPDN + High-NA, strategic 2nd source) |
| Density target | ~313 MTr/mm² HD logic (N2 class); SRAM macro 38.1 Mb/mm² |
| Die area budget | 100-130 mm² (flagship-mobile band) |
| Wafer / NRE | $30-45k/wafer; $250-400M total tapeout NRE |
| Power envelope | 5 W peak / 3.5 W sustained @ 95 °C Tj |
| Rail count | 16 rails (3 CPU clusters + NPU + GPU + SOC fabric + SRAM + 4 LPDDR rails + analog + AON + PMC + 2 IO + USB PHY + RF ref) |
| Regulator strategy | External buck per primary rail + per-core dLDO on big-CPU and NPU for fast DVFS; AON retention LDO |
| Droop response | <20 ns dLDO + 1-cycle clock stretcher + AVFS with PVT-compensated canary FFs |
| Power-mgmt firmware | RISC-V SBI MPxy + RPMI v1.0 (ratified) — drop-in equivalent to Arm SCMI; Linux drivers landed 2025 in 6.x |
| Mgmt core | Ibex RV32IMC on AON rail |
| Decap | SHPMIM-class on-die MIM (~150 nF/mm² in hot blocks), ~5× Cdec/Iavg total |
| PDN signoff | Voltus or RedHawk-SC dynamic IR + EM, 16-32 corners, multi-Vt; OpenROAD psm only for triage |
| Macro placement | AlphaChip RL (open) or DREAMPlace 4.0 (open, GPU, 30× faster) for soft+hard macros; commercial Cerebrus AI Studio / DSO.ai if partner available |
| Clock tree | CCOpt or ClockMesh+ConcurrentClockData (mesh + leaf H-tree hybrid is N3/N2 standard); TritonCTS is triage |
| STA | MMMC POCV/SOCV with LVF, 100-200 effective corners, ML pruning to 32-64 |
| Open-PDK demonstrator track | IHP SG13G2 130 nm + Sky130 (Basilisk-class, ~77 MHz ceiling — proves open methodology, not perf) |
| Layer | 2028 baseline |
|---|---|
| Toolchain | LLVM trunk pinned; -mcpu=eliza-e1 -march=rva23u64 -O3 -flto=thin -fprofile-sample-use=… -fbasic-block-sections=labels then Propeller relink then BOLT |
| Vector | RVV 1.0 VLEN=256 in CPU; autovec + hand-tuned intrinsics for top-20 kernels |
| Matrix | Live in NPU only; do not depend on AME/IME/VME ratification (AME data-type vote recalled Dec 2025) |
| NPU compiler | MLIR/IREE with custom elizanpu dialect emitting descriptors → command queue → tile DMA |
| PyTorch path | ExecuTorch backend → IREE |
| TFLite/LiteRT | NNAPI/AIDL HAL wired to descriptor submission; INT2/INT4/FP8 ops |
| Quantization | PTQ INT8 baseline; AWQ INT4 weight-only for LLMs; FP8 E4M3 on long-context; 2:4 sparsity wired (chip already has SDOT4_S4_2_4) |
| Android | RVA23 mandatory; ART RISC-V backend optimization (vendor work); baseline profiles + profile-guided dexopt; ThinLTO in system image |
| CFI | Zicfilp + Zicfiss landing-pad + shadow-stack on by default |
| AutoFDO+Propeller+BOLT measured uplift | 12-18% on full system image (10% AutoFDO+Propeller, 2-6% BOLT, ~2% MFS), net 5-10% after Spectre mitigations |
| Kernel | Linux 6.19+ with RISC-V Spectre mitigations (5-10% cost); SBI MPxy + RPMI for power; mainline RISC-V IOMMU driver |
Most-cited blockers, deduplicated across the 8 sub-reports:
| Wall | Domain | Impact |
|---|---|---|
| No open LPDDR5X/6 PHY exists | Memory, PD | Must license Synopsys / Cadence / Rambus. Breaks "open RISC-V" narrative at analog tier. License mid-7-figures + per-chip royalty. PHY occupies 5-7 mm². |
| Foundry access at N2P / A14 / 14A | Process, PD | Apple holds >50% of TSMC N2 through 2027 Q2. Open project has no leverage at TSMC. Realistic path: Intel Foundry Services (DARPA/RAMP-C subsidy) or hyperscaler/government anchor. |
| Tapeout NRE | Process | $250-400M single tapeout at N2P; $300-500M at A14. Open-source funding models do not reach this scale; needs commercial customer or government program. |
| Commercial signoff EDA is closed | PD, Power | Voltus, RedHawk-SC, PrimeTime, Tempus, Quantus, IC Validator, Pegasus. OpenROAD has zero certified PDKs sub-7 nm. Plan for $5M+/year EDA license budget. |
| No open mobile-class PMIC IP | Power | v0 must use 6-8 catalog buck/LDO ICs on daughtercard, controlled via I²C/SPMI by mgmt core. v1 internalizes. |
| RISC-V IOMMU driver maturity | Memory | Spec ratified Sep 2024, Linux driver landed v6.10-6.12. Android dma-buf v2 + gralloc + NN HAL bindings need contributor work through 2026-2027. |
| AME/IME/VME matrix not ratified for 2028 | OoO, Compiler | Plan around RVV-1.0 + NPU matrix. Do not assume standard matrix ISA in 2028 ship. |
| AOSP RISC-V Tier-1 still volatile | Compiler | Google removed RISC-V from AOSP common kernel April 2024, reinstated Tier-1 late 2025. Pin our own AOSP branch with checked-in manifest SHA. |
| OpenROAD sub-7 nm unproven | PD | Largest open-flow Linux-capable SoC: Basilisk at 130 nm, 77 MHz. Stage-3 N2P closure assumes commercial tools. |
| AlphaChip proxy ≠ PPA win | PD | "False Dawn" + arXiv 2302.11014 reanalysis. e1 256-macro smoke: 3.01% proxy win, but verification on routed PPA mandatory. |
| RVV autovec lags SVE2 in LLVM | Compiler | Igalia data shows 9% geomean uplift in 18 months; predication and stride loads still mis-costed. Closing fast; will be near closed by 2027. Hand-tune top-20 kernels. |
| Closed-vendor NPU stacks (QNN, NeuroPilot, Core ML) are years ahead | Compiler, NPU | IREE+ExecuTorch is the best open answer but is 1.5-3× slower than vendor SDKs on like-for-like models in 2026. |
| BSPDN thermal penalty + DFT complications | Process, PD | If 2029 A14P / Intel 14A variant is built. Active layer buried in BEOL; Tj rises 5-10 °C; probe access changes; multi-quarter learning curve. |
| SRAM scaling stalled at N3 | Process, Cache | TSMC N3 only ~5% SRAM density improvement vs N5. N2 resumes via macro-level density (38.1 Mb/mm²), not bitcell shrink. Size SLC twice — once for N2, once for 14A. |
| Repo gate inconsistency: 240 vs 120 GB/s | Memory, Power | soc-optimized-operating-point.yaml presumes 240 GB/s sustained; gate requires 120 GB/s. LPDDR5X 64-bit caps at 85 GB/s. Split SKUs to reconcile. |
Unified eval gauntlet across all domains. Mandatory artifacts before any 2028 claim.
| Layer | Metric | Tool | 2028 target | Comparator |
|---|---|---|---|---|
| CPU big | SPEC CPU2017 intrate / core | SPEC + LLVM | ≥9 | X925 ~11.8, A19 ~12 |
| CPU big | Geekbench 6 ST | GB6 | ≥2800 | A19 Pro 3895, Oryon Gen 3 3649, C1-Ultra 3502 |
| CPU big | Geekbench 6 MT | GB6 | ≥8500 | S8E5 11068, D9500 10417, A19 Pro 9988 |
| CPU big | CoreMark / MHz | CoreMark | ≥10 | X925 ~13, Zen 5 ~12.5 |
| CPU | JetStream2 | V8 / Hermes | ≥250 | Android/Chrome reality check |
| Front-end | SPECint avg MPKI | gem5-XiangShan + traces | ≤4.0 | TAGE-SC-L 64KB ≈ 3.986 |
| Front-end | Android cold-launch MPKI | simpleperf + gem5 | ≤8 | dominated by I-footprint, not direction misprediction |
| Front-end | Taken-branch throughput | microbenchmark | ≥1.5/cycle sustained | 2 taken/cycle = Zen 5/X925/Lion Cove parity |
| Cache | lmbench lat_mem_rd curve | lmbench | publish full curve | chipsandcheese A-series / X-series curves |
| Cache | STREAM Triad | STREAM | ≥120 GB/s stretch / ≥70 baseline | Apple A19 Pro 75.8 GB/s, D9500 ~85 GB/s |
| Cache | LLC IPC vs LRU | ChampSim + DPC-3 | +12-15% (Mockingjay) | published Mockingjay/Hawkeye numbers |
| Memory | lmbench bw_mem sustained | lmbench | ≥120 GB/s | S8E5 84.8, D9500 ~85 |
| Memory | Pointer-chase p95 latency | lmbench | ≤120 ns | Apple A-series ~115 ns |
| Memory | UFS+DRAM contention | fio + STREAM | UFS BW degrade ≤15% under DRAM saturation | UFS 4.1 |
| Memory | dma-buf stale-buffer negative | bespoke | must fault or be statically forbidden | required by uma-coherency-validation-strategy |
| Memory | IOMMU fault injection | bespoke | fault entry has master / IOVA / access / syndrome | RISC-V IOMMU spec |
| NPU | MLPerf Mobile / Tiny | LiteRT + ExecuTorch | per npu-2028-target.yaml | Snapdragon 8 Elite Gen 5 / Dimensity 9500 MLPerf |
| NPU | unsupported_ops / cpu_fallback_pct | per-model report | ≤1% fallback | published vendor model lists |
| Concurrent | Display underflow at 120 Hz QHD + NPU + camera + CPU | bespoke | 0 underflow events | this is the SLC-pays-its-rent test |
| Process | DTCO sensitivity | OpenLane × {Sky130, GF180, IHP SG13, ASAP7} | clean DRC/LVS, timing-clean at design freq for each PDK | Basilisk at 130 nm 77 MHz |
| Process | Multi-corner STA | OpenSTA / PrimeTime | 16-32 dominant corners, ML pruning from 100-200 | N3/N2 industry practice |
| Power | Activity-traced power | PrimePower / Voltus + workload VCDs | matches modeled operating point | competitor sustained scores (independent reviews) |
| Power | Static + dynamic IR-drop | Voltus / RedHawk-SC | <5% nominal static / <10% dynamic | foundry sign-off |
| Power | Sustained perf vs Tj | 30-min real-workload runs | sustained GB6 MT ≥8500 | A19 Pro ~8500, S8E5 ~9200, D9500 ~8400 |
| Compiler | LLVM nightly llvm-test-suite | LLVM CI | no regressions | LLVM trunk |
| Compiler | AutoFDO + Propeller + BOLT uplift | full build matrix | 5-10% net (after Spectre) | Google AutoFDO+Propeller 10% kernel |
| Compiler | RVV autovec quality | Igalia-style 16-50 kernel suite | ≥ LLVM-stock geomean | GCC 15 / SVE2 reference |
| Android | Cold-start ms (Chrome, YouTube, Maps) | perfetto / systrace | ≤900 ms Chrome cold | D9500-class flagship |
| Security | Zicfilp + Zicfiss enabled | runtime check | on by default | matches BTI/SCS in Android |
elizanpu dialect lowering linalg.matmul/conv/attention/softmax/layer-norm to descriptor ring — replaces Python smoke.docs/arch/branch-prediction.md contract + eliza-kunminghu-manifest.json pinning open BPU IP commit hash.pd/openlane/config.<node>.json per target (sky130, gf180, ihp-sg13, asap7-predictive, n2p-stub, a14-stub).soc-optimized-operating-point.yaml and uma-dram-evidence-gate.yaml — split LPDDR5X baseline / LPDDR6 stretch SKUs.P0 (Q3-Q4 2026): repo discipline + decisions that cannot slip.
soc-optimized-operating-point.yaml vs uma-dram-evidence-gate.yaml; split LPDDR5X / LPDDR6 SKUs.mobile-sota-2026.yaml from non-goal to procurement requirement with vendor shortlist.docs/arch/branch-prediction.md contract.P1 (Q1-Q2 2027): buildout that does not need an advanced-node PDK.
elizanpu dialect; ExecuTorch RISC-V backend prototype.P2 (Q3 2027 — Q2 2028): advanced-node closure.
P3 (Q3-Q4 2028): tapeout, dev-board, bring-up.
Per-domain research artifacts (full sources, tables, citations):