packages/chip/docs/architecture-optimization/sota-2028/ooo-execution.md
Sub-report of 2028-sota-integrated-report.md.
Numbers are public/disclosed where vendors provided them and reverse-engineered (chipsandcheese, Dougall Johnson, WikiChip Fuse, jia.je/cpu) elsewhere. "FE" = front-end.
| Core | ISA | FE fetch / decode | Dispatch / Retire | ROB / in-flight | PRF (INT / FP) | Int ALU | AGU / LSU | FP-Vec | Vector | Sched | L1I / L1D / L2 / SLC | Process | F_max | IPC SPEC2017 int | GB6 ST | Area / core (mm²) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Apple A19 Pro "Tahiti" (P) | ARMv9.5-A | 16B, ~9-wide | ~10-wide | ~700 ROB-eq | ~432 / ~432 (est) | 6+ | 4 | 4 | 128b NEON+SME2 | clustered ~6 | 192/128/16 MB / 8 MB SLC | TSMC N3P | 4.26 GHz | ~11-12 est | 3895 | ~2.0 |
| Qualcomm Oryon Gen 3 Prime | ARMv9.2-A | 16B, 9-wide | 9-wide | ~650 ROB | ~400 / ~400 | 6 (incl 2 br) | 3+ | 4 (256b VLEN agg) | 128b NEON | unified+dist | 192/96/12 MB / 16 MB SLC | TSMC N3P | 4.74 GHz | ~10.5 | 3649 | ~2.1 |
| Arm Cortex-X925 "Blackhawk" | ARMv9.2-A | 32B, 10-wide | 10-wide | ~525 effective (768 inflight + 1536 fused) | ~448 total | 8 ALU 4×28-sched, 3 br | 4 AGU (2 store) | 6 FP/SIMD (3×~53) | 128b NEON+SVE2 | dist 4 INT + 3 FP | 64/64/2-3 MB | TSMC N3E | 3.62-3.8 GHz | ~11.8 | 3000-3400 | ~1.7 |
| Arm C1-Ultra "Travis" | ARMv9.3-A | 32B, 10-wide refined | 10-wide | ~2000 in-flight | ~512+ | 8 ALU | 4 AGU | 6 FP/SIMD+SME2 | 128b NEON+SVE2+SME2 | dist + SME sched | 64/64/3 MB / 16 MB SLC | TSMC N3P | 4.21 GHz | ~13.2 (Arm +25% vs X925) | 3502 | ~1.9 |
| AMD Zen 5 | x86-64 | 32B, dual 4-wide (=8) | 8-wide | 448 ROB | 240 INT / 384 FP (512b) | 6 ALU unified | 4 AGU | 4 FP (2× decoupled) + 1 store/FP-to-INT | 512b AVX-512 | unified INT, decoupled FP | 32/48/1 MB / 32 MB L3 | TSMC N4P | 5.7 GHz | ~12 | 3400 (9950X) | ~3.6 (incl L2) |
| Intel Lion Cove (Lunar Lake) | x86-64 | 32B, 8-wide (µop 12-wide) | 8-wide, HT removed | ~576 ROB | INT/FP/VEC files deeper | 6 ALU | 4 AGU (3 ld + 2 st overlap) | 4 vector (2×128 + 2×256) | 256b AVX2 (no AVX-512 on LL) | split 144 INT (6 ports) + 96 vector (5 ports) | 64/48/2.5-3 MB / 8 MB | TSMC N3B | 5.1 GHz | ~12.5 | 2900-3100 | ~4.5 |
| Tenstorrent Ascalon-D8 | RVA23 | 16B, 8-wide | 8-wide | ~450-500 ROB (est) | ~400 / ~400 | 6 INT (2 br) | 3 LSU | 2 FP + 2×256-bit RVV | 256b RVV 1.0 | ~6-queue dist | 64/64/1-2 MB | N3/N5 IP | ~3.2 GHz target | ~21 SPECint2006/GHz ≈ 9-10 IPC SPEC2017 | n/a | ~1.8 IP |
| SiFive P870 | RVA23 (V) | 16B, 6-wide | 6-wide | ~1120 inflight | ~256 / 128 V renames | 5 INT + 1 br-cap | 2-3 LSU | 2 FP/V w/ V sequencer | 128b RVV 1.0 | 4-queue dist | 64/64/1 MB | N7/N5 IP | ~3.0 GHz | ~13.5 SPECint2006/GHz ≈ 6.5 IPC SPEC2017 | n/a | ~1.2 IP |
| XiangShan Kunminghu V3 | RV64GC + V | 16B, 6-wide | 6-wide | ~256-320 ROB | ~192 / ~192 | 4 ALU + 2 br | 2-3 LSU | 2 FP + 2 V | 128b RVV 1.0 | dist | 64/64/1 MB | 7/12/28 nm | ~3.0 GHz sim | >15 pt/GHz target → 20 (~7-9 IPC SPEC2017) | Neoverse-N2 -8% claim | ~1.5 (12 nm) |
| Ventana Veyron V2 | RV64GC+V | 16B, 8-wide (fusion magnifies) | up to 15 internal ops/clock with fusion | ~480+ ROB | ~384 / 256 | 6+ INT | 4 LSU | 2-4 FP/V | 256-512b RVV config | dist | 64/64/2 MB | TSMC N4/N3 | ~3.6 GHz | ~17 SPECint2006/GHz ≈ 8 IPC SPEC2017 | n/a (datacenter) | ~2.2 |
| MIPS P8700 | RV64GC | 8B fetch, 4-wide | 4-wide | ~96 ROB | smaller | 4 ALU | 2 LSU | 2 FP | 128b | dist | 64/64/256 KB-2 MB | 7/16 nm | ~2.5 GHz | ~6 IPC SPEC2017 est | n/a | ~0.9 |
| BOOM v3 / SonicBOOM | RV64GC | 4-8 wide param | 2-4-wide typical | 32-256 ROB param | 64-128 / 64-128 | 1-4 ALU | 1-2 LSU | 1-2 FP | optional | dist | param | acad/FPGA | ~1.5-2.5 GHz ASIC est | 6.2 CoreMark/MHz → 3-5 IPC SPEC2017 | n/a | param |
| AMD Strix Point Zen 5c | x86-64 | 32B, 8-wide | 8-wide | 448 ROB | 240/384 | 6 | 4 | 4 FP | 256b AVX-512 native | unified | 32/48/1 MB / 24 MB | TSMC N4P | 5.1 GHz | ~11.5 | 2900 | ~2.8 |
Reference rows: D9500 (C1-Ultra 4.21 / Premium 3.5 / Pro 2.7 GHz), S8 Elite Gen 5 (Oryon Gen 3 Prime 4.6-4.74 / Perf 3.62 GHz), Tensor G5 (X4 3.78 / A725 3.05 / A520 2.25 GHz), Exynos 2600 (Samsung 2 nm GAA, GB6 ST 3197). Source: docs/spec-db/mobile-sota-2026.yaml.
packages/chiprtl/cpu/e1_cpu_subsystem_stub.sv — tiny in-order RV64 fetch/execute: 32-bit AXI-Lite manager, 32 archregs as 64-bit, supports JAL/JALR/BEQ/BNE/LUI/AUIPC/ADDI/ADD/SUB/LW/SW, halts on ECALL/EBREAK/illegal/AXI error. Not Linux-capable. No CSR, privilege, MMU, traps, atomics, compressed, float, vector.rtl/cpu/e1_cva6_wrapper.sv — drop-in wrapper for OpenHW CVA6 (ArianeDefaultConfig, RV64IMAFDC + S-mode + Sv39), guarded by +define+E1_HAVE_CVA6. CVA6 = 6-stage single-issue in-order with limited speculation. Closest commercial peer: Cortex-A55-class. Expected SPEC2017 int IPC: ~1.5-1.8 on RTL, ~10× behind Cortex-X925.eliza-rocket-manifest.json, commit 69eba860): 5-stage in-order, single-issue, RV64GC. SPEC2017 int IPC ~1.0-1.5.benchmarks/results/simulator-arch-metrics-sota.json): 2-core, 3.8 GHz, modeled IPC 2.42, 2.76 W package. Architecture target only — cpu_ap_evidence_blocked.make cpu-ap-completion-gate.Gap to A19 Pro / Oryon Gen 3 / C1-Ultra:
| Parameter | Target | Rationale |
|---|---|---|
| ISA | RVA23 + V (RVV 1.0 VLEN=256) + Zfh/Zvfh + Zvbb/Zvkt + Zicboz/Zicbom + Ztso + Sv57 + Smaia (AIA) + Zihintpause + Zicond + Zama (matrix) | RVA23 mandated for Android RISC-V ABI; Ztso to run translated x86/ARM with single fence cost; Sv57 future-proofs >1 TB virtual; Ubuntu 25.10 mandates RVA23 |
| Front-end fetch | 32 B / cycle | RVC means up to 16 inst in 32 B; supports 10-wide decode |
| Decode | 8-wide native + 2 fused = 10 effective | Match X925/C1-Ultra; macro-op fusion recovers density loss vs ARM |
| L0 µop cache | 3 K entries, 12-wide read | Apple/Lion Cove style; bypasses decode for hot loops |
| Branch predictor | 16K-entry L1 BTB, 64K L2 BTB, TAGE-SC-L, 32-entry RAS, ITTAGE | Matches X925's 16K/2048+L2. Mispredict <14 cyc |
| Dispatch / Rename / Retire | 8 / 8 | PRF-based renaming (not ROB-based) for energy |
| ROB | 512 entries (~700 effective with fusion expansion) | Between X925 effective (~525) and Apple (~700) |
| PRF | 400 INT (64-bit) / 400 FP+V (256-bit) | Apple-class. Vector reg width 256b matches RVV DLEN=256 |
| Schedulers | Distributed: 4×32 INT, 2×48 FP/V, 2×40 LSU | Match X925 four-cluster; energy << unified at this width |
| Execution ports | 6 ALU (2 br), 2 IMUL/IDIV shared on ALU0/3, 4 FP/V, 4 AGU (2 ld + 2 st + 2 dual) | Match X925 (8 ALU, 4 AGU) but trim 6 ALU + 4 AGU |
| Load / Store queue | 192 LQ / 128 SQ | Apple-class; Oryon ~150/100. Store-set predictor for memory-disamb |
| Store-to-load forwarding | 4 simultaneous, partial-overlap | X925 explicitly improved; Zen 5 perfect-store forwarding |
| Vector unit | 2× 256-bit RVV 1.0 (DLEN=256, ELEN=64), Zvbb/Zvfh/Zvkt; future SME-like matrix tile via Zama | Matches Ascalon (2× 256b). X925 is 4× 128b NEON; equal effective BW |
| Matrix | Reserve area for RISC-V matrix when ratified; meanwhile expose via Zvqdotq/Zvfh and CSR-mapped tile regs | A19 Pro / C1-Ultra bet on SME2 INT8/BF16 tile units |
| Memory ordering | RVWMO native + Ztso mode selectable per-page (PTE bit) or per-thread | Ztso lets x86 binaries run without fence-spam, ~5-15% perf for translated |
| MMU | Sv48 default, Sv57 enabled; ASID 16-bit; 64 L1 ITLB + 64 L1 DTLB + 2048 L2 unified | Matches X925 (96/2048). Required for >32 GB phone RAM |
| Caches | 64 KB L1I (4-way) + 64 KB L1D (4-way, 4-cycle), private 1 MB L2 (12-cycle), shared 8 MB L3, 16 MB SLC | Matches X925 / D9500 |
| Clock | 4.0-4.3 GHz typical, 4.5 GHz burst | Below A19 Pro 4.26 to guard RVWMO+Ztso fence costs |
| IPC SPEC2017 target | ≥9 (≥ 22 SPECint2006/GHz, beating Ascalon's 21) | Achievable per Veyron V2 |
| Area (3 nm class) | ~1.8-2.0 mm² incl L2 | Tracks X925 (~1.7) |
| Power | 2.8 W burst, 1.4 W sustained | Matches soc-optimized-operating-point.yaml |
Required plan extending cpu-npu-2028-readiness-scorecard.yaml line 115-125:
| Benchmark | Metric | 2028 target (big core) | 2026 reference |
|---|---|---|---|
| SPEC CPU2017 int rate | per-core | ≥9 | X925 ~11.8, A19 ~12 |
| SPEC CPU2017 int speed | speed | ≥7 | X925 ~8, Zen 5 ~9 |
| SPEC CPU2017 fp rate | rate | ≥7 | X925 ~10, Zen 5 ~14 |
| Geekbench 6 ST | score | ≥2800 | A19 3895, Oryon Gen 3 3649, C1-Ultra 3502 |
| Geekbench 6 MT | score | ≥8500 (8-core) | A19 9988, S8EG5 11068, D9500 10417 |
| CoreMark/MHz | rate | ≥10 | BOOM 6.2, X925 ~13, Zen 5 ~12.5 |
| CoreMark-Pro | composite | ≥25k | enterprise/mobile parity gate |
| Embench-IoT | composite | full pass | small-program code-density gate |
| JetStream 2 (V8) | composite | ≥250 | Android/Chrome reality check |
| Octane 2.0 | score | ≥70k | legacy JS reality check |
| SPECjbb 2015 max-jOPS | jOPS | ≥40k | server-class java; CHI/coherency stress |
| STREAM Triad | GB/s | ≥150 | per mobile-sota-2026.yaml LPDDR5X |
lmbench lat_mem_rd | ns at 1 GB stride | ≤120 ns | TLB+DRAM latency floor |
lmbench bw_mem | GB/s | ≥120 | sustained memcpy |
| fio randread 4k QD32 | IOPS | ≥800k | UFS 4.1 path |
| systrace / perfetto cold-start | ms | ≤900 ms Chrome cold | D9500-class flagship |
| MLPerf Mobile (CPU fallback) | mobilenet-v3 / bert-mobile | within 2× NPU latency | "NPU real, CPU fallback acceptable" |
Required infrastructure:
llvm-test-suite.aosp_simulator_evidence_blocked).powertop --html, iio:device* rails on board.cpu-npu-2028-burst-thermal-transient.json.npu_arch_sim_open_2028 / cpu_arch_sim_sota_2028.lui+addi, lui+ld, slli+add, auipc+jalr, addi+bne. RISC-V fusion ~5.4% effective inst reduction.docs/spec-db/npu-2028-target.yaml requires cache_coherent_cpu_submission. CHI-coherent NPUs rare; Tensor G5 TPU non-coherent.