packages/chip/docs/architecture-optimization/sota-2028/branch-predictors.md
Sub-report of 2028-sota-integrated-report.md.
Every credible high-IPC core in 2025-2026 — Apple, Qualcomm, ARM, Intel, AMD, BOOM, XiangShan — uses a TAGE-derived multi-component conditional predictor. Differences are in storage budget, history lengths, statistical-corrector add-ons, and how the front-end is decoupled from fetch.
CVA6 (Ariane) default cv64a6_imafdc_sv39_hpdcache — BTBEntries=32, BHTEntries=128, RASDepth=2. Single-issue in-order, 6-stage. Categorically inadequate for a 2028 flagship-mobile AP.
BOOM uses TAGE-L (TAGE + loop) behind a small NLP (micro-BTB + BIM + RAS). MegaBoom budgets TAGE around 4-8 KB with ~6 tagged tables.
XiangShan Kunminghu v2 — current SOTA open-source high-performance RISC-V BPU:
| Component | Configuration |
|---|---|
| uFTB (micro-FTB, 1-cycle) | 256 entries, 4-bit GHR |
| FTB (Fetch Target Buffer, replaces BTB) | 2,048 entries, 4-way, 20-bit tag |
| TAGE conditional | 4 tables × 4,096 × 8-bit tags; histories {8, 13, 32, 119} — 16K total |
| ITTAGE indirect | 5 tables × {256,256,512,512,512} × 9-bit tags; histories {4,8,13,16,32} — ~2K |
| RAS | 16 architectural / 32 speculative, 3-bit counter |
| SC | 4 tables × 512 rows × 6-bit, histories {0,4,10,16} |
| FTQ | 64 entries |
| IBuffer | 48 entries |
| Decode width | 6 |
XiangShan reports SPECCPU2006 above 15 points/GHz on Kunminghu, with v3 targeting 20/GHz.
192 KB total storage budget, with 64 KB TAGE-SC-L baseline:
Every flagship core uses a decoupled BPU running ahead of fetch via an FTQ. Modern revisits show FDIP recovers most front-end stall when BPU accuracy holds, but degrades badly on mobile/server workloads with large I-footprints — exactly the Android case. (UDP; PDIP, ASPLOS '24; DEER, arXiv 2504.20387)
A dynamic predictor reduces cycles ~10%, adds ~7% core power on average. For a 1-2 W mobile big-core: BPU power 3-6%, BPU+L1I area 5-10%.
AsmDB (ISCA '19) shows datacenter and large mobile workloads spend a substantial fraction of cycles in front-end stalls dominated by I-cache and BTB misses, not data misses. The repo's SOTA-2-core model encodes per-workload MPKI of 1.116 (CoreMark-like), 3.472 (Linux kernel mix), 4.464 (Android UI), 1.922 (TFLite CPU fallback). These are reasonable planning numbers for a 2028 flagship.
packages/chiprtl/cpu/e1_cpu_subsystem_stub.sv — no branch predictor of any kind. Every branch resolved by sequential FSM.rtl/cpu/e1_cva6_wrapper.sv, gated by E1_HAVE_CVA6. Brings the toy default predictor: 32-entry BTB, 128-entry 2-bit bimodal BHT, 2-entry RAS.ElizaRocketConfig. Rocket's BPU is similarly minimal.benchmarks/results/simulator-arch-metrics-sota.json is a static input to the perf/W model, not driven by any predictor structure.Honest claim today: the project has zero branch prediction implementation, evidence, or planning artifact specific to that subsystem.
Adopt XiangShan Kunminghu BPU shape, scaled to a 2028 flagship envelope. Kunminghu/KMH-v3 is the only open-source RISC-V design in 2025-2026 with a credible, taped-out, publicly-documented TAGE-SC + ITTAGE + uFTB/FTB + RAS + SC + loop stack. BOOM TAGE-L is a viable fallback; from-scratch BPU is multi-year risk.
Hard targets for the big P-core:
| Component | Target | Rationale |
|---|---|---|
| Family | TAGE-SC-L + ITTAGE + RAS, FTB-based, decoupled BPU with FTQ | Matches XiangShan KMH and Apple/Oryon storage class |
| uFTB | 512 entries, 4-way, ~16 KB reach for zero-bubble | Above KMH 256; below X925 2K |
| FTB | 8,192 entries, 8-way, 24-bit tag, 2 taken/cycle | Match X925 hierarchy |
| L2 BTB | 16,384, 2-3 cyc | Match X925/Zen 5 reach |
| TAGE-SC conditional | 5 tables × 4-8 K, ~64-96 KB, histories geometric to ~200 bits | KMH currently 16K/4 tables; CBP-5 floor 64 KB |
| Statistical corrector | 4-8 tables × 1K rows, signed counters | Standard tail |
| Loop predictor | 64 entries | Cheap, high payoff on SPEC libquantum/leela |
| ITTAGE | 5-6 tables × ~512 entries, history to 64 bits, ~2-3K total | Above KMH ~2K; below Oryon 2,048-indirect |
| RAS | 32 architectural / 64 speculative, with overflow handling | KMH 16/32; Oryon 48; X925 29; Lion Cove 24 |
| FTQ | 64-96 entries | KMH 64; needed for FDIP |
| Misprediction penalty | ≤ 12-14 cycles | Oryon 13, Zen 4/5 13 |
Implement FDIP from day one. BPU must run ahead of fetch via FTQ and emit prefetch requests into L1I. Add a shadow structure (per-FTB next-line hint or AsmDB-style boot-up I-prefetch) to defend against Android cold-launch I-footprint blowups.
For RISC-V mobile flagship: skip a full µop cache (Lion Cove 5,250 µop / 12-wide is x86-baroque; RISC-V's fixed encoding makes decode cheap). Use a small decoded-instruction buffer (KMH IBuffer = 48 entries). Apple-style move is larger L1I (64-128 KB) with FDIP, not a µop cache. Oryon's 192 KB L1I is the existence proof.
| Workload | MPKI target | Plausibility |
|---|---|---|
| SPECint 2017 average | ≤ 4.0 | TAGE-SC-L 64 KB CBP train 3.986; Bullseye 3.4 |
| 505.mcf | ≤ 11 | Branch-misp dominated; X925/Zen 5 zone |
| 541.leela | ≤ 5 | Loop+H2P heavy |
| Geekbench 6 Navigation | ≤ 6 | Skymont reported 4.33 |
| Android UI (ART/JIT) | ≤ 5 | Tracks repo planning 4.464 |
| Android cold-launch | ≤ 8 | Stall-dominated; FDIP + L1I more than predictor accuracy |
| Linux kernel mix | ≤ 4 | Repo planning 3.472 |
Primary: track XiangShan Kunminghu v2 → v3 upstream and patch into the e1 SoC. Add eliza-kunminghu-manifest.json discipline + make xiangshan-generator-check. Keep CVA6 wrapper for first Linux/Android bring-up smoke. Do not plan a from-scratch BPU.
Three layers under packages/chip/benchmarks/:
-d in_asm,nochain or Spike branch-trace plugin to dump (PC, target, taken, kind).| Benchmark | Why |
|---|---|
| SPEC CPU2017 intrate (gcc, mcf, leela, omnetpp, xalancbmk, perlbench, deepsjeng, exchange2, x264, xz) | Industry standard; license-gated |
| Embench-IoT 1.0 | Permissive, RISC-V-native |
| CoreMark-Pro | Permissive successor; modest branch pressure |
| Geekbench 6 + open equivalents | Competitor comparison axis |
| AsmDB Android traces + AOSP simpleperf | Real Android front-end pressure |
| JetStream2 / Octane v2 in V8/Hermes | Indirect-branch heavy; stresses ITTAGE |
| App-startup traces (Chrome, YouTube cold) | Closest to real flagship benchmark axis |
Mandatory; compress traces to ~10-100M-instruction representative checkpoints. Full SPEC2017 on cycle-accurate XS-gem5 / verilator + BPU is infeasible at iteration speed.
Measurable on competitor silicon:
BR_MIS_PRED_RETIRED, BR_RETIRED, BR_IND_MIS_PRED_RETIRED) via simpleperf.frame_metrics.Not measurable (closed):
Gauntlet (gate before any 2028 BPU claim):
simulator-arch-metrics{,-sota}.json.docs/arch/branch-prediction.md contract mirroring cpu-subsystem.md style.eliza-kunminghu-manifest.json pinning open BPU IP commit hash.make branch-prediction-check gate emitting BTB/FTB/TAGE/ITTAGE/RAS sizes to docs/evidence/cpu_ap/branch-prediction-params.json.benchmarks/cpu/branch/.Zihpm).compute-silicon.md for the operating-point optimizer.pd/openlane floorplan from day one.