Back to Eliza

Branch prediction contract

packages/chip/docs/arch/branch-prediction.md

2.0.311.8 KB
Original Source

Branch prediction contract

rtl/cpu/bpu/ carries the synthesizable Branch Prediction Unit for the Eliza E1 application processor. The BPU is decoupled from instruction fetch: a Fetch Target Queue (FTQ) buffers predicted fetch blocks between the BPU and the L1I, so that BPU stages can run ahead of fetch and emit prefetch hints in the style of FDIP and XiangShan Kunminghu.

This document mirrors the contract style of docs/arch/cpu-subsystem.md. It is the externally checkable description of the BPU shape, ISA-visible PMU events, accuracy targets, and blockers. The fail-closed evidence gate is make branch-prediction-check which writes docs/evidence/cpu_ap/branch-prediction-params.json.

Boundary

rtl/cpu/bpu/bpu_top.sv exposes a structured lookup/resolve interface:

DirectionSignalPurpose
inlkp_valid, lkp_pcDrive a single PC into the BPU per cycle.
outpred_valid, pred (bpu_lookup_t)Aggregated prediction.
infetch_popFetch dequeue strobe.
outfetch_valid, fetch_entry (ftq_entry_t)Top of the FTQ.
inresolve (bpu_resolve_t)Resolver feedback from the back-end.
incsr_re, csr_addrRead port for the 64-bit PMU counters.
outcsr_rdata, pmu_strbCounter value and event strobes.

Both the prediction and the resolve buses are timed to a single cycle in the current geometry; the FTQ is the decoupling structure between the BPU and the fetch engine.

Selected topology

The BPU shape is a scaled XiangShan Kunminghu derivative. Numbers come from rtl/cpu/bpu/bpu_pkg.sv and are enforced by the evidence gate at the thresholds called out in the right-most column. See docs/architecture-optimization/sota-2028/branch-predictors.md for the SOTA rationale.

ComponentSelected2028 minimum thresholdRationale
FETCH_BLOCK_BYTES323216 RVC inst/predict, matches Zen 5 / X925 / Lion Cove.
MAX_BR_PER_BLOCK212 taken/cycle target by 2028, MVP 1-taken acceptable.
FTQ_ENTRIES6432Decouple BPU from fetch, FDIP-compatible.
UFTB_ENTRIES512256Zero-bubble next-line predictor, above KMH 256.
FTB_ENTRIES20482048BTB replacement, KMH v2 floor.
FTB_WAYS44Match KMH/X925 set-associative footprint.
TAGE_TABLES54TAGE-SC-L stack on top of bimodal.
TAGE_ENTRIES_TABLE40964096CBP-5 floor; KMH-class.
TAGE_HIST_LEN{8, 13, 32, 64, 119}reach >= 100Geometric history.
BIM_ENTRIES163848192Base bimodal table.
SC_TABLES44Statistical corrector for low-confidence TAGE.
SC_ENTRIES_TABLE512512Seznec CBP-5 baseline.
LOOP_ENTRIES6432Loop-trip predictor.
ITTAGE_TABLES55Indirect target predictor.
ITTAGE_ENTRIES{256, 256, 512, 512, 512}>= 1024 totalMatches KMH-v2.
RAS_ARCH_ENTRIES3216Architectural depth.
RAS_SPEC_ENTRIES6432Speculative depth with overflow counter.

PMU events (Zihpm)

pmu_event_e in bpu_pkg::pmu_event_e is the canonical event encoding. The BPU exports pmu_strb and csr_rdata for SoC-level Zihpm integration; CSR counter indices are the enum value of the event.

The BPU enum is ordered so the mapping into zihpm_pkg::hpm_event_e is a pure +1 offset (zihpm reserves id 0 for the "no event" sentinel). The translation is bpu_pkg::bpu_pmu_to_hpm(pmu_id) = pmu_id + 1.

BPU idEventzihpm idzihpm enumDescription
0PMU_BR_PRED1EVT_BR_PREDTotal predictions emitted.
1PMU_BR_TAKEN2EVT_BR_TAKENPredictions where the direction was taken.
2PMU_BR_MISP3EVT_BR_MISPMispredictions reported by the resolver.
3PMU_BR_COND4EVT_BR_CONDConditional branches predicted.
4PMU_BR_COND_MISP5EVT_BR_COND_MISPConditional branch mispredictions.
5PMU_BR_IND6EVT_BR_INDIndirect branches predicted.
6PMU_BR_IND_MISP7EVT_BR_IND_MISPIndirect branch mispredictions.
7PMU_BR_CALL8EVT_BR_CALLCall predictions.
8PMU_BR_RET9EVT_BR_RETReturn predictions.
9PMU_BR_RET_MISP10EVT_BR_RET_MISPReturn mispredictions.
10PMU_RAS_OVERFLOW11EVT_RAS_OVERFLOWRAS push into a full speculative stack.
11PMU_RAS_UNDERFLOW12EVT_RAS_UNDERFLOWRAS pop from an empty speculative stack.
12PMU_FTQ_FULL13EVT_FTQ_FULLFTQ full strobe.
13PMU_FTQ_EMPTY14EVT_FTQ_EMPTYFTQ empty strobe.
14PMU_FETCH_BUBBLE15EVT_FETCH_BUBBLEFetch popped while FTQ was empty.
15PMU_FTB_MISS16EVT_BTB_MISSFTB read missed.
16PMU_UFTB_HIT17EVT_UFTB_HITuFTB read hit.
17PMU_TAGE_ALLOC18EVT_TAGE_ALLOCTAGE allocated a new entry.
18PMU_LOOP_HIT19EVT_LOOP_HITLoop predictor produced a high-confidence prediction.
19PMU_SC_OVERRIDE20EVT_SC_OVERRIDESC overrode TAGE on a low-confidence prediction.

These are visible to Linux perf via Zihpm event selectors documented in docs/evidence/cpu_ap/branch-prediction-params.json. The OoO cluster wires the BPU's pmu_strb bit i onto its event bus at position bpu_pmu_to_hpm(i) so the Zihpm counters see exactly one strobe per BPU event firing, with no further translation logic in the integration top.

Accuracy targets

WorkloadMPKI ceilingStatus
TAGE-SC-L on CBP-5 synthetic trace<= 4.5local cocotb harness in benchmarks/cpu/branch/.
SPECint2017 intrate, geomean<= 4.0BLOCKED: requires SPEC license + cycle-accurate gem5-XiangShan.
Geekbench 6 navigation<= 6BLOCKED: closed benchmark.
Android UI (AOSP, ART/JIT)<= 5BLOCKED: requires AsmDB/simpleperf trace ingestion.
Android cold-launch (Chrome/YouTube)<= 8BLOCKED: requires AOSP system trace.
Linux kernel mix<= 4BLOCKED: requires simpleperf capture on Linux-capable AP boot.
V8 JetStream2 indirect dispatch<= 4% indirect mispBLOCKED: requires JS-engine trace.

The local cocotb MPKI harness (benchmarks/cpu/branch/run_mpki.py) measures the BPU against synthetic and trace-replay workloads. Real-workload numbers remain BLOCKED until SPEC/AOSP/JS evidence is in place.

Cross-domain contracts

Two interfaces leave the BPU domain.

PMU → Zihpm

The BPU emits one strobe per cycle into pmu_strb[PMU_EVENTS-1:0]. The OoO domain consumes it through rtl/cpu/csr/bpu_to_zihpm_remap.sv, which lands each strobe into its Zihpm-event-bus slot. The BPU enum is locked so the mapping is a pure +1 offset, with the helper bpu_pkg::bpu_pmu_to_hpm() encoding the rule.

BPU sideOoO side
rtl/cpu/bpu/bpu_pkg.sv (pmu_event_e, bpu_pmu_to_hpm())rtl/cpu/csr/zihpm.sv (hpm_event_e)
20-bit pmu_strb from bpu_top.pmu_strb256-bit zihpm event bus driven by bpu_to_zihpm_remap
Counter readout: csr_addr 0..19 → 64-bit counterOS-visible Zihpm CSRs mhpmcounter3..15

Coordination evidence is produced by scripts/check_pmu_event_alignment.py (writes docs/evidence/cpu_ap/pmu-event-alignment.json).

FTQ → L1I

The BPU writes predicted fetch blocks into the FTQ and emits a downstream prefetch request via rtl/cpu/bpu/ftq_to_l1i_shim.sv. The cache domain consumes e1_ftq_to_l1i_pkg::ftq_prefetch_req_t (40-bit physical line + 3-bit confidence + branch-target hint) on a single-cycle valid/ready handshake with a separate flush strobe for misprediction recovery.

The shim performs three translations:

  1. 39-bit Sv39 virtual target_pc → 40-bit physical line address (assumes identity V→P at this stage; real translation lives on the cache side).
  2. br_kind_e → 3-bit confidence (BR_NONE=0, BR_COND=4, BR_CALL=5, BR_RET=6).
  3. branch_target = fetch_entry.taken.

The cluster top (rtl/cpu/cluster/e1_cluster_top.sv) wires the shim between bpu_top.fetch_entry and the cache domain.

Blockers

  1. XiangShan upstream licensing — Mulan PSL v2; resolved by adoption, tracked via generators/xiangshan/eliza-kunminghu-manifest.json (BPU IP pin) and generators/chipyard/eliza-kunminghu-manifest.json (whole-core selection, owned by the OoO domain).
  2. Two-taken-per-cycle — current geometry parameterises MAX_BR_PER_BLOCK = 2 but the prediction pipeline only emits one taken branch per cycle. Lifting this to two requires a dual-port FTB read path and a non-contiguous fetch contract.
  3. L1I prefetch pathftq_to_l1i_shim lands the prefetch request on the cache agent's interface, but the cache-side prefetch engine and the iTLB-on-receive translation remain in the cache domain.
  4. Real-trace MPKI evidence — see Accuracy targets.
  5. Verilator/Yosys/SBY hosting — the chip package has historically relied on Docker/Nix shells for these tools; the local oss-cad-suite checkout under external/oss-cad-suite/ resolves them. make bpu-lint, make cocotb-bpu, and make formal-bpu fail closed with STATUS: BLOCKED when the suite is missing.
  6. Formal coverage for the FTQ and RAS — yosys 0.64 (oss-cad-suite) does not accept struct typedefs in module port lists, and its async-reset handling lets the BMC pick arbitrary initial values for reset-driven flops. Both formal harnesses fail closed with named yosys limitations and the cocotb regression (33/33 across 9 modules) carries functional coverage in the interim.

Verification surface

GateCommandOutput
Parameter geometrymake branch-prediction-checkdocs/evidence/cpu_ap/branch-prediction-params.json
Cross-domain PMU IDsmake pmu-event-alignment-checkdocs/evidence/cpu_ap/pmu-event-alignment.json
Verilator strict lintmake bpu-lintbuild/reports/bpu/lint-status.yaml
Cocotb regressionmake cocotb-bpuverify/cocotb/bpu/results/*.xml
SymbiYosys formalmake formal-bpubuild/reports/bpu/formal-status.yaml
MPKI eval (RTL, cocotb)make mpki-eval-rtldocs/evidence/cpu_ap/mpki_results_synthetic.json
MPKI eval (model only)make mpki-eval-modelbenchmarks/results/branch-prediction-mpki-model.json
MPKI vs CBP-5 tabledocs/evidence/cpu_ap/mpki_synthetic_vs_cbp5_reference.md

Files

  • rtl/cpu/bpu/bpu_pkg.sv — parameter and type package.
  • rtl/cpu/bpu/bimodal.sv, tage_table.sv, tage.sv — TAGE direction.
  • rtl/cpu/bpu/sc.sv — statistical corrector.
  • rtl/cpu/bpu/loop_predictor.sv — loop predictor.
  • rtl/cpu/bpu/ittage.sv — indirect target predictor.
  • rtl/cpu/bpu/ftb.sv, uftb.sv — fetch target buffer + zero-bubble buddy.
  • rtl/cpu/bpu/ras.sv — return address stack.
  • rtl/cpu/bpu/ftq.sv — fetch target queue.
  • rtl/cpu/bpu/bpu_csr.sv — PMU counters and useful-bit reset.
  • rtl/cpu/bpu/bpu_top.sv — integration top.
  • rtl/cpu/bpu/ftq_to_l1i_shim.sv — translation to the cache domain's L1I-prefetch interface.
  • verify/cocotb/bpu/ — cocotb unit and integration tests (9 wrappers / 33 tests).
  • verify/formal/bpu/ — SymbiYosys formal harnesses.
  • benchmarks/cpu/branch/ — MPKI harness and synthetic traces (8 synthetic generators).
  • generators/xiangshan/eliza-kunminghu-manifest.json — BPU IP-pin manifest.
  • docs/generators/xiangshan/eliza-kunminghu-manifest.json — historical manifest predating the IP-pin/whole-core split; both files are kept in lockstep via scripts/check_branch_prediction.py.
  • docs/evidence/cpu_ap/branch-prediction-params.json — evidence emitted by scripts/check_branch_prediction.py.
  • docs/evidence/cpu_ap/pmu-event-alignment.json — cross-domain PMU alignment evidence.