Back to Eliza

OoO Execution SOTA — 2028 RISC-V Phone-Class AP

packages/chip/docs/architecture-optimization/sota-2028/ooo-execution.md

2.0.320.5 KB
Original Source

OoO Execution SOTA — 2028 RISC-V Phone-Class AP

Sub-report of 2028-sota-integrated-report.md.

A. SOTA Snapshot — Comparative uarch table (2024-2026 flagships)

Numbers are public/disclosed where vendors provided them and reverse-engineered (chipsandcheese, Dougall Johnson, WikiChip Fuse, jia.je/cpu) elsewhere. "FE" = front-end.

CoreISAFE fetch / decodeDispatch / RetireROB / in-flightPRF (INT / FP)Int ALUAGU / LSUFP-VecVectorSchedL1I / L1D / L2 / SLCProcessF_maxIPC SPEC2017 intGB6 STArea / core (mm²)
Apple A19 Pro "Tahiti" (P)ARMv9.5-A16B, ~9-wide~10-wide~700 ROB-eq~432 / ~432 (est)6+44128b NEON+SME2clustered ~6192/128/16 MB / 8 MB SLCTSMC N3P4.26 GHz~11-12 est3895~2.0
Qualcomm Oryon Gen 3 PrimeARMv9.2-A16B, 9-wide9-wide~650 ROB~400 / ~4006 (incl 2 br)3+4 (256b VLEN agg)128b NEONunified+dist192/96/12 MB / 16 MB SLCTSMC N3P4.74 GHz~10.53649~2.1
Arm Cortex-X925 "Blackhawk"ARMv9.2-A32B, 10-wide10-wide~525 effective (768 inflight + 1536 fused)~448 total8 ALU 4×28-sched, 3 br4 AGU (2 store)6 FP/SIMD (3×~53)128b NEON+SVE2dist 4 INT + 3 FP64/64/2-3 MBTSMC N3E3.62-3.8 GHz~11.83000-3400~1.7
Arm C1-Ultra "Travis"ARMv9.3-A32B, 10-wide refined10-wide~2000 in-flight~512+8 ALU4 AGU6 FP/SIMD+SME2128b NEON+SVE2+SME2dist + SME sched64/64/3 MB / 16 MB SLCTSMC N3P4.21 GHz~13.2 (Arm +25% vs X925)3502~1.9
AMD Zen 5x86-6432B, dual 4-wide (=8)8-wide448 ROB240 INT / 384 FP (512b)6 ALU unified4 AGU4 FP (2× decoupled) + 1 store/FP-to-INT512b AVX-512unified INT, decoupled FP32/48/1 MB / 32 MB L3TSMC N4P5.7 GHz~123400 (9950X)~3.6 (incl L2)
Intel Lion Cove (Lunar Lake)x86-6432B, 8-wide (µop 12-wide)8-wide, HT removed~576 ROBINT/FP/VEC files deeper6 ALU4 AGU (3 ld + 2 st overlap)4 vector (2×128 + 2×256)256b AVX2 (no AVX-512 on LL)split 144 INT (6 ports) + 96 vector (5 ports)64/48/2.5-3 MB / 8 MBTSMC N3B5.1 GHz~12.52900-3100~4.5
Tenstorrent Ascalon-D8RVA2316B, 8-wide8-wide~450-500 ROB (est)~400 / ~4006 INT (2 br)3 LSU2 FP + 2×256-bit RVV256b RVV 1.0~6-queue dist64/64/1-2 MBN3/N5 IP~3.2 GHz target~21 SPECint2006/GHz ≈ 9-10 IPC SPEC2017n/a~1.8 IP
SiFive P870RVA23 (V)16B, 6-wide6-wide~1120 inflight~256 / 128 V renames5 INT + 1 br-cap2-3 LSU2 FP/V w/ V sequencer128b RVV 1.04-queue dist64/64/1 MBN7/N5 IP~3.0 GHz~13.5 SPECint2006/GHz ≈ 6.5 IPC SPEC2017n/a~1.2 IP
XiangShan Kunminghu V3RV64GC + V16B, 6-wide6-wide~256-320 ROB~192 / ~1924 ALU + 2 br2-3 LSU2 FP + 2 V128b RVV 1.0dist64/64/1 MB7/12/28 nm~3.0 GHz sim>15 pt/GHz target → 20 (~7-9 IPC SPEC2017)Neoverse-N2 -8% claim~1.5 (12 nm)
Ventana Veyron V2RV64GC+V16B, 8-wide (fusion magnifies)up to 15 internal ops/clock with fusion~480+ ROB~384 / 2566+ INT4 LSU2-4 FP/V256-512b RVV configdist64/64/2 MBTSMC N4/N3~3.6 GHz~17 SPECint2006/GHz ≈ 8 IPC SPEC2017n/a (datacenter)~2.2
MIPS P8700RV64GC8B fetch, 4-wide4-wide~96 ROBsmaller4 ALU2 LSU2 FP128bdist64/64/256 KB-2 MB7/16 nm~2.5 GHz~6 IPC SPEC2017 estn/a~0.9
BOOM v3 / SonicBOOMRV64GC4-8 wide param2-4-wide typical32-256 ROB param64-128 / 64-1281-4 ALU1-2 LSU1-2 FPoptionaldistparamacad/FPGA~1.5-2.5 GHz ASIC est6.2 CoreMark/MHz → 3-5 IPC SPEC2017n/aparam
AMD Strix Point Zen 5cx86-6432B, 8-wide8-wide448 ROB240/384644 FP256b AVX-512 nativeunified32/48/1 MB / 24 MBTSMC N4P5.1 GHz~11.52900~2.8

Reference rows: D9500 (C1-Ultra 4.21 / Premium 3.5 / Pro 2.7 GHz), S8 Elite Gen 5 (Oryon Gen 3 Prime 4.6-4.74 / Perf 3.62 GHz), Tensor G5 (X4 3.78 / A725 3.05 / A520 2.25 GHz), Exynos 2600 (Samsung 2 nm GAA, GB6 ST 3197). Source: docs/spec-db/mobile-sota-2026.yaml.

B. Current state in packages/chip

  • rtl/cpu/e1_cpu_subsystem_stub.sv — tiny in-order RV64 fetch/execute: 32-bit AXI-Lite manager, 32 archregs as 64-bit, supports JAL/JALR/BEQ/BNE/LUI/AUIPC/ADDI/ADD/SUB/LW/SW, halts on ECALL/EBREAK/illegal/AXI error. Not Linux-capable. No CSR, privilege, MMU, traps, atomics, compressed, float, vector.
  • rtl/cpu/e1_cva6_wrapper.sv — drop-in wrapper for OpenHW CVA6 (ArianeDefaultConfig, RV64IMAFDC + S-mode + Sv39), guarded by +define+E1_HAVE_CVA6. CVA6 = 6-stage single-issue in-order with limited speculation. Closest commercial peer: Cortex-A55-class. Expected SPEC2017 int IPC: ~1.5-1.8 on RTL, ~10× behind Cortex-X925.
  • Chipyard Rocket (selected per eliza-rocket-manifest.json, commit 69eba860): 5-stage in-order, single-issue, RV64GC. SPEC2017 int IPC ~1.0-1.5.
  • Modeled CPU planning point (benchmarks/results/simulator-arch-metrics-sota.json): 2-core, 3.8 GHz, modeled IPC 2.42, 2.76 W package. Architecture target only — cpu_ap_evidence_blocked.
  • All flagship-class claims fail-closed blocked until make cpu-ap-completion-gate.

Gap to A19 Pro / Oryon Gen 3 / C1-Ultra:

  • Decode width: 1 → 10 (10×)
  • ROB: 0 → ~600-2000 (>500×)
  • SPEC2017 IPC: ~1.0 → ~12 (~10×)
  • ISA: RV64I subset → RVA23 + V + matrix
  • Memory ordering: none → RVWMO+Ztso
  • DVFS, big.LITTLE: none → 1+3+4

Big core ("e1-ultra"), 1 instance per cluster

ParameterTargetRationale
ISARVA23 + V (RVV 1.0 VLEN=256) + Zfh/Zvfh + Zvbb/Zvkt + Zicboz/Zicbom + Ztso + Sv57 + Smaia (AIA) + Zihintpause + Zicond + Zama (matrix)RVA23 mandated for Android RISC-V ABI; Ztso to run translated x86/ARM with single fence cost; Sv57 future-proofs >1 TB virtual; Ubuntu 25.10 mandates RVA23
Front-end fetch32 B / cycleRVC means up to 16 inst in 32 B; supports 10-wide decode
Decode8-wide native + 2 fused = 10 effectiveMatch X925/C1-Ultra; macro-op fusion recovers density loss vs ARM
L0 µop cache3 K entries, 12-wide readApple/Lion Cove style; bypasses decode for hot loops
Branch predictor16K-entry L1 BTB, 64K L2 BTB, TAGE-SC-L, 32-entry RAS, ITTAGEMatches X925's 16K/2048+L2. Mispredict <14 cyc
Dispatch / Rename / Retire8 / 8PRF-based renaming (not ROB-based) for energy
ROB512 entries (~700 effective with fusion expansion)Between X925 effective (~525) and Apple (~700)
PRF400 INT (64-bit) / 400 FP+V (256-bit)Apple-class. Vector reg width 256b matches RVV DLEN=256
SchedulersDistributed: 4×32 INT, 2×48 FP/V, 2×40 LSUMatch X925 four-cluster; energy << unified at this width
Execution ports6 ALU (2 br), 2 IMUL/IDIV shared on ALU0/3, 4 FP/V, 4 AGU (2 ld + 2 st + 2 dual)Match X925 (8 ALU, 4 AGU) but trim 6 ALU + 4 AGU
Load / Store queue192 LQ / 128 SQApple-class; Oryon ~150/100. Store-set predictor for memory-disamb
Store-to-load forwarding4 simultaneous, partial-overlapX925 explicitly improved; Zen 5 perfect-store forwarding
Vector unit2× 256-bit RVV 1.0 (DLEN=256, ELEN=64), Zvbb/Zvfh/Zvkt; future SME-like matrix tile via ZamaMatches Ascalon (2× 256b). X925 is 4× 128b NEON; equal effective BW
MatrixReserve area for RISC-V matrix when ratified; meanwhile expose via Zvqdotq/Zvfh and CSR-mapped tile regsA19 Pro / C1-Ultra bet on SME2 INT8/BF16 tile units
Memory orderingRVWMO native + Ztso mode selectable per-page (PTE bit) or per-threadZtso lets x86 binaries run without fence-spam, ~5-15% perf for translated
MMUSv48 default, Sv57 enabled; ASID 16-bit; 64 L1 ITLB + 64 L1 DTLB + 2048 L2 unifiedMatches X925 (96/2048). Required for >32 GB phone RAM
Caches64 KB L1I (4-way) + 64 KB L1D (4-way, 4-cycle), private 1 MB L2 (12-cycle), shared 8 MB L3, 16 MB SLCMatches X925 / D9500
Clock4.0-4.3 GHz typical, 4.5 GHz burstBelow A19 Pro 4.26 to guard RVWMO+Ztso fence costs
IPC SPEC2017 target≥9 (≥ 22 SPECint2006/GHz, beating Ascalon's 21)Achievable per Veyron V2
Area (3 nm class)~1.8-2.0 mm² incl L2Tracks X925 (~1.7)
Power2.8 W burst, 1.4 W sustainedMatches soc-optimized-operating-point.yaml

Mid core ("e1-premium"), 3 instances

  • Base: fork of XiangShan Kunminghu V3 (open Mulan PSL v2)
  • 6-wide decode, 6 dispatch, 256-entry ROB, 192/192 PRF
  • 1× 128-bit RVV 1.0
  • RVA23, no SME, RVWMO only
  • 32 KB / 32 KB / 512 KB private L2
  • 3.0-3.4 GHz
  • IPC target ~5 (matches Kunminghu V3 "15 pt/GHz" current → "20 pt/GHz")
  • Area ~0.7-0.9 mm²

Little core ("e1-pro"), 4 instances

  • Base: CVA6 (OpenHW, Solderpad) or Chipyard Rocket — pick CVA6 (RV64GC+S-mode Sv39 in-tree)
  • 6-stage in-order, single-issue
  • Optional 2-way superscalar variant
  • 32 / 32 / shared cluster L2 (256 KB)
  • 1.8-2.2 GHz
  • IPC ~1.6
  • Area ~0.25 mm² each

Cluster topology (matches D9500 / Apple 2+4)

  • 1× e1-ultra + 3× e1-premium + 4× e1-pro
  • DSU-110-equivalent "e1-coherent-bus" with 16 MB SLC, MESI-class snoop filter, CHI-like protocol
  • Per-core power gating, per-cluster DVFS, retention voltage for L1
  • Management hart: separate small Ibex (lowRISC Apache-2.0) for boot/security/PMU
  • Total CPU area budget (3 nm class): ~7 mm² (1×2.0 + 3×0.9 + 4×0.25 + DSU+SLC overhead)

Open-source path recommendation (ranked)

  1. Fork Tenstorrent Ascalon-D8 — Apache-licensed IP, 8-wide OoO RVA23 + 256-bit RVV, LLVM upstream support (Jim Keller / Apple-Tesla-AMD-Arm alumni). Highest-confidence flagship-class numbers if licensing terms work for mobile SKU. Primary path.
  2. Fork XiangShan Kunminghu V3 — Mulan PSL v2 (Apache-compatible-ish), 6-wide, "8% behind Neoverse N2", scalable 7/12/28 nm. Best fully-open. Mid core.
  3. SonicBOOM (BOOMv3) — UCB BSD, 2-4-wide configurable, 6.2 CoreMarks/MHz. Academic exploration / mid-core fallback.
  4. SiFive P870 — commercial license; macro-op fusion; not open RTL.
  5. Veyron V2 — datacenter-focused; not licensable for mobile.

D. Benchmarks / Eval / Testing

Required plan extending cpu-npu-2028-readiness-scorecard.yaml line 115-125:

BenchmarkMetric2028 target (big core)2026 reference
SPEC CPU2017 int rateper-core≥9X925 ~11.8, A19 ~12
SPEC CPU2017 int speedspeed≥7X925 ~8, Zen 5 ~9
SPEC CPU2017 fp raterate≥7X925 ~10, Zen 5 ~14
Geekbench 6 STscore≥2800A19 3895, Oryon Gen 3 3649, C1-Ultra 3502
Geekbench 6 MTscore≥8500 (8-core)A19 9988, S8EG5 11068, D9500 10417
CoreMark/MHzrate≥10BOOM 6.2, X925 ~13, Zen 5 ~12.5
CoreMark-Procomposite≥25kenterprise/mobile parity gate
Embench-IoTcompositefull passsmall-program code-density gate
JetStream 2 (V8)composite≥250Android/Chrome reality check
Octane 2.0score≥70klegacy JS reality check
SPECjbb 2015 max-jOPSjOPS≥40kserver-class java; CHI/coherency stress
STREAM TriadGB/s≥150per mobile-sota-2026.yaml LPDDR5X
lmbench lat_mem_rdns at 1 GB stride≤120 nsTLB+DRAM latency floor
lmbench bw_memGB/s≥120sustained memcpy
fio randread 4k QD32IOPS≥800kUFS 4.1 path
systrace / perfetto cold-startms≤900 ms Chrome coldD9500-class flagship
MLPerf Mobile (CPU fallback)mobilenet-v3 / bert-mobilewithin 2× NPU latency"NPU real, CPU fallback acceptable"

Required infrastructure:

  • Cycle-accurate Verilator with FST waves, hooked to gem5 (XS-GEM5 fork claims >95% SPECCPU 2006 correlation, openxiangshan/GEM5).
  • FireSim FPGA-accelerated full-system on AWS F1 / F2: boot Linux + SPEC at 100+ MHz.
  • DiplomatTracer or equivalent for AMBA CHI traces.
  • LLVM compiler perf regression with llvm-test-suite.
  • Android RVA23 Cuttlefish boot (gated as aosp_simulator_evidence_blocked).
  • Power: powertop --html, iio:device* rails on board.
  • Thermal: ARM perfetto thermal_zone trace correlated with cpu-npu-2028-burst-thermal-transient.json.

E. Optimizations: has / should / needs

Has

  • AXI-Lite contract scaffold + CVA6 wrapper boundary.
  • Modeled CPU+NPU operating point (IPC 1.8 base / 2.42 SOTA, 3.2-3.8 GHz, 1.4 W).
  • Process-14a derate model.
  • Fail-closed evidence gates.
  • Modeled benchmark harness passing npu_arch_sim_open_2028 / cpu_arch_sim_sota_2028.

Should (medium-term, Linux smoke)

  • Real Chipyard Rocket integration.
  • Full RV64GC + S-mode + Sv39 + CLINT + PLIC + UART boot.
  • TileLink-AXI bridge with 64-bit data, atomics, MOESI snoop.
  • OpenSBI + U-Boot + Linux 6.x + minimal Android container.

Definitely needs (2028 flagship)

  1. TAGE-SC-L with 16K L1 BTB and 64K L2 BTB; ITTAGE for indirect; RAS for calls.
  2. PRF-based register renaming (energy < ROB-based at 400+ regs).
  3. Distributed schedulers, capture-rename — X925 4×28 beats unified.
  4. Store-set predictor for memory disambiguation.
  5. RVWMO with optional Ztso PTE-bit — enables Box64-like x86 emulation without 4-12% fence-tax.
  6. Macro-op fusion: lui+addi, lui+ld, slli+add, auipc+jalr, addi+bne. RISC-V fusion ~5.4% effective inst reduction.
  7. L0 µop cache (3-4 K, 12-wide read).
  8. Decoupled fetch — branch-predict-ahead queue feeds 4-deep fetch to L1I.
  9. 2-cycle conditional branch predictor + 1-cycle BTB hot path — Oryon Gen 3 specifically improved.
  10. Aggressive clock gating on FP/V; SVE2/SME-equivalent shutdown for INT-only.
  11. Per-core power gating <50 µs wake; retention voltage on L1.
  12. Spectre/Meltdown mitigations: invisible speculation (InvisiSpec / DAWG-like cache partitioning); 3-7% IPC if architected from day 1, 10-20% if bolted on.
  13. Hardware page-table walker with 4-port concurrent walks.
  14. Coherent IOMMU/SMMU for NPU and GPU sharing CPU virtual address.
  15. CHI-like coherent bus with snoop filter (MOESI or MESIF).
  16. Cluster shared L3 + SoC SLC two-level (matches D9500 16 + 10).
  17. Hardware prefetchers: stream + stride + region + entanglement at L2; matches A18+.
  18. DVFS with workload-aware governors, EAS-style with Android scheduler — Tensor G5 underperforms partly due to weak DVFS, not weak silicon.

F. Risks and open questions

Licensing

  • Ascalon-D8: Tenstorrent IP licensing for mobile SKUs not published. LLVM merges upstream, but RTL is commercial IP. Mobile-volume per-unit royalty may be equivalent to Cortex-X licensing.
  • Kunminghu V3: Mulan PSL v2 GPL-3-like with copyleft. Combining with proprietary Android BSP at link time murky. Cooley LLP / RISC-V International legal review required.
  • CVA6: Solderpad / Apache-2.0; integration risk low, but in-order so 2028 flagship impossible from CVA6 alone.
  • BOOM: BSD; cleanest legally, but research-grade RTL.

RTL maturity

  • Tenstorrent Ascalon silicon-proven. ✓
  • XiangShan Kunminghu V3 taped out at academic node (28 nm), not flagship 3 nm. Verification gap to 3 nm large.
  • BOOM has FPGA evidence (Zynq, AWS F1) but no commercial ASIC tapeout.
  • All open OoO RISC-V cores lack validated PPA on N3P/N3B.

Verification debt

  • Flagship-class OoO needs: 10⁹+ random instruction tests (RISCV-DV, Imperas riscv-tests), formal proof of memory ordering, Spectre/MDS gadget audits, 4-week+ AVP nightly regressions, deadlock liveness for coherent fabric, ECC injection.
  • Open RISC-V has nothing comparable to Arm AVS. Commission RISC-V Compliance Test Suite extensions for V/Ztso/SME.
  • Formal-vs-Linux-boot gap: core can pass formal RVA23 conformance and hang Linux at userspace TLB shootdown (real bug in early Cortex-A57). Need targeted Linux kernel race-condition stress (LKDTM, LTP, stress-ng) on FPGA before tapeout.

Software ecosystem

  • RVV 1.0 compiler: GCC 14+/LLVM 19+ usable but auto-vectorizer ~30-50% behind ARM NEON. Hand-written intrinsics for hotspots (memcpy, libc, ffmpeg, OpenSSL) need explicit funding.
  • Android RISC-V: Google paused 2024, restarted limited support contingent on RVA23. Mainline AOSP RVA23 expected late 2026/2027. Binary-app ecosystem (Snapchat, TikTok, banking) lags 2-3 years.
  • Matrix extension (Zama, RVM): not ratified mid-2026. Lock-in risk to path that doesn't ratify.
  • JIT performance: V8/Hermes/ART JIT on RISC-V ~70% of NEON-on-ARM per JetStream2. Significant work for Android usability.

Hardware design open questions

  • PTE-bit-driven Ztso: only Veyron has implemented per-page TSO toggles. Need ISA WG buy-in.
  • Cluster heterogeneity: with 1+3+4, does cluster shared L3 hang off DSU-like uncore or split? D9500 splits SLC. Apple unifies.
  • NPU coherency: docs/spec-db/npu-2028-target.yaml requires cache_coherent_cpu_submission. CHI-coherent NPUs rare; Tensor G5 TPU non-coherent.
  • Vector vs Matrix: 2×256-bit RVV vs 1×512-bit + matrix tile is same arithmetic, vastly different software cost. Matrix wins for transformer prefill; RVV wins for image/audio/general.

Schedule risk

  • 2028 tapeout from 2026-05 requires: pick core decision by Q4 2026, RTL freeze Q4 2027, tapeout 2028H1, sample silicon 2028H2.
  • Open RISC-V has slipped 2-3× on similar schedules (BOOM, XiangShan).
  • Conservative: target 2029 phone product with 2028 dev-board silicon, 12 months for Android compatibility and CTS/VTS.

Summary recommendation

  1. Big core: 8-wide decode, ~512 ROB, ~400+400 PRF, 6 ALU + 4 AGU + 4 FP/V (2× 256-bit RVV), distributed schedulers, RVA23+V+Ztso+Sv57, 4.0-4.3 GHz, ~2 mm² 3 nm, IPC ≥9 SPEC2017int. Primary path: fork Tenstorrent Ascalon-D8. Fallback: scale up XiangShan Kunminghu V3 to 8-wide.
  2. Mid core: XiangShan Kunminghu V3 6-wide, ~256 ROB, 3.2 GHz, ~0.85 mm². 3 instances.
  3. Little core: CVA6 6-stage in-order, 2.0 GHz, ~0.25 mm². 4 instances.
  4. Topology: 1+3+4 matching D9500 / Apple; CHI-coherent DSU with 8 MB L3 + 16 MB SLC; Ibex mgmt hart.
  5. Memory ordering: RVWMO native + Ztso (per-page PTE bit) for x86/ARM binary translation.
  6. Benchmark gates: GB6 ST ≥2800, MT ≥8500, SPEC2017 int rate/core ≥9, JetStream2 ≥250.
  7. Honest 2028 stretch: parity with C1-Ultra GB6 3502 / Oryon Gen 3 3649; A19 Pro 3895 is aspirational.
  8. Schedule: realistic phone-silicon target slips to 2029 product (2028 dev-board).

Sources