packages/chip/docs/architecture-optimization/sota-2028/compiler-tuning.md
Sub-report of 2028-sota-integrated-report.md.
RVV 1.0 codegen is real but uneven. RVV 1.0 ratified 2021. LLVM treats it as fully supported but autovec quality is still maturing.
-menable-experimental-extensions. Zfh, Zicboz, Zicbom, Zihintntl, Ztso, Zacas land in LLVM 19-23 progressively.In 2026 LLVM is the canonical RVV target. Autovec is good enough as baseline but vendor-tuned-intrinsics gap on hot kernels is still 1.5x-3x on stride/predication-heavy code. Strategy: autovec everywhere, then hand-tuned intrinsics on top-N kernels.
Composable: stacked PGO+ThinLTO+Propeller+BOLT realistically delivers 12-18% on a system image (10% AutoFDO+Propeller, 2-6% BOLT, ~2% MFS). Linux 6.19 RISC-V Spectre mitigations cost 5-10% in tight loops, so net win for security-on builds ~5-10%.
RISCV_HWPROBE_IMA_V flag indicates RVV 1.0; fence.i may go through kernel-managed vDSO.| Stack | License | Hardware coverage | LLM/INT4 | Open compiler? |
|---|---|---|---|---|
| Qualcomm QNN (Hexagon HVX/HMX) | Closed | Hexagon NPU, CPU, GPU, LPAI | INT4 weight-only via QNN profiles | No; LiteRT has direct QNN delegate |
| MediaTek NeuroPilot | Closed (SDK gated) | Dimensity NPU | INT4 via Compiled Model API; Google LiteRT NeuroPilot stack Dec 2025 | No |
| Apple Core ML / ANE / SME2 | Closed | ANE (SME2 underneath per Orion paper), GPU | int8 arrays direct on iOS/macOS 26+ | No; Orion reverse-engineered private _ANECompiler |
| Google IREE/MLIR | Apache-2.0 | CPU+GPU+Vulkan+TFLite; Synaptics SL2610 Torq NPU, AMD via MLPerf SDXL Apr 2025, Coral NPU | Via StableHLO/MLIR lowerings | Yes |
| Apache TVM (Ansor/AutoTVM) | Apache-2.0 | Wide; MediaTek paper combined TVM+NeuroPilot | Yes via Relax | Yes |
| PyTorch ExecuTorch | BSD | Apple, Qualcomm, Arm, MediaTek, Vulkan, XNNPACK CPU | INT4 PT2E quantization | Yes; ships in AOSP external tree |
| LiteRT (formerly TFLite) | Apache-2.0 | CPU (XNNPACK), GPU, NPU (QNN, NeuroPilot), Coral | INT2/INT4 in TF 2.21 (Mar 2026) | Frontend yes; backends mixed |
For open RISC-V chip with no closed-vendor SDK to lean on, IREE + ExecuTorch + XNNPACK is the only sane open stack. LiteRT 2.21 reports up to 100x CPU-vs-NPU and 10x GPU-vs-NPU on supported delegates; we'd need a custom IREE backend to claim anything similar.
lpad (AUIPC opcode, rd=x0). Compiler emits lpad on address-taken funcs and indirect-branch targets.sspush/sspchk/ssrdp/ssamoswap.Three competing proposals at RISC-V International:
AME data-type vote was recalled Dec 2025 for architectural pivot. No matrix extension will be ratified in time for 2028 ship. Matrix lives in NPU, not CPU, for at least one more product cycle. Arm SME2/Apple ANE is years ahead.
packages/chipEvidence from these files:
compiler/runtime/e1_npu_runtime.py — 660-line Python MMIO contract enforcer. Implements scalar ADD/SUB/MUL_LO/MAC_S16/DOT4_S8/DOT8_S4/DOT16_S2/DOT4_FP8_E4M3, packed RELU4_S8/VRELU_S8, sparse SDOT4_S4_2_4, bounded GEMM_S8/GEMM_S4 with M≤3, N≤3, K≤7 inside 64-byte scratchpad. 4-word descriptor ring with valid_owner, writeback_request, byte-count, scratch-offset packing.docs/spec-db/e1-npu-runtime-contract.json — schema eliza.e1_npu_runtime_contract.v1. Self-classifies as L0 RTL UNIT, prototype only, explicitly disclaims NNAPI/AIDL/phone-class TOPS/production DMA/sustained perf/MLIR-StableHLO-TFLite-ExecuTorch compiler path.docs/arch/npu.md — entire opcode ABI is single-cycle MMIO write/poll. compiler/runtime/e1_npu_lowering.py provides single-op lowering "smoke" for stablehlo.dot_general, tflite.fully_connected, tflite.conv_2d, attention-QK, attention-AV, MLP, bias-add, residual-add, transformer-block — host-side tiling stitches multiple 3x3x7 GEMM tiles. Host does im2col, transpose, requantize. No parser, no scheduler, no graph partitioner, no delegate, no quantization calibration.docs/arch/npu-microarch.md — planned v0: Chipyard-default Gemmini (16×16 INT8) wrapped through MMIO with 64-byte descriptor ring, 0x1002_0000 window. Implemented today: scalar fallback (e1_npu.sv); Gemmini wrapper "to be added".docs/toolchain/riscv64-cross-host.md — host has only riscv64-elf-gcc 16.1.0 + riscv64-elf-binutils + QEMU 11.0.0. No glibc cross. No LLVM/Clang for RISC-V. No ART. No NDK. No AOSP toolchain.docs/architecture-optimization/software-ci.md — 25 lines; says benchmark/AOSP/firmware claims must have real tool execution and fail-closed gates. No compiler tuning section exists.docs/toolchain/benchmark-simulator-critical-gap-audit.md — CoreMark/STREAM/lmbench/fio/TFLite benchmark_model/MLPerf Mobile all planned_missing_deps or blocked. Docker base not pinned. No flake.lock. No ELF hash archive. No PGO/AutoFDO/Propeller/BOLT tooling listed at all.docs/npu/2028-targets.md — target (160 TOPS dense INT8 peak, 80 sustained, 512 sparse INT4 peak, 18 TOPS/W). Software direction names "AIDL HAL, TFLite delegate, NNAPI, StableHLO MLIR, IREE or TVM, ExecuTorch". All future; none implemented.grep -r "LLVM|AutoFDO|Propeller|BOLT|ThinLTO|MLIR|IREE|TVM|ExecuTorch|PGO|LTO" across all docs: 4 hits — benchmark matrix (ExecuTorch as Samsung Exynos competitor signal), 2028-targets aspirational list, chipyard/circt toolchain note, MediaTek competitor reference. No checked-in evidence of any of these in actual build/CI flow. No PGO profile capture script. No Propeller integration. No ThinLTO toggle. No AutoFDO collection harness. No MLIR/IREE backend dialect. No ExecuTorch lowering. No baseline profile workflow.
State: a Python MMIO contract enforcer plus aspirational competitor citations, with no compiler tuning evidence in the repo.
e1_npu contract is right scaffold; missing piece is proper MLIR dialect (eliza_npu) and IREE backend emitting descriptors instead of MMIO writes one at a time.-O3 -mcpu=eliza-e1 -march=rva23u64 -mtune=eliza-e1 -fvectorize -flto=thin -fprofile-sample-use=<autofdo> -fbasic-block-sections=labels then Propeller relink then BOLT on final image.-fcf-protection=full (Zicfilp/Zicfiss), -fstack-clash-protection, -fsanitize=shadow-call-stack for tagged components, -fstack-protector-strong. Expect 5-10% in worst-case loops, much less averaged.-fexperimental-relative-c++-abi-vtables for framework — saves a few MB in system images.elizanpu dialect under compiler/iree-eliza-npu/ lowering linalg.matmul, linalg.conv_2d_nhwc_hwio, attention, softmax, layer-norm, gelu/swiglu into descriptors feeding existing submit_descriptors ABI. Current Python e1_npu_lowering.py is throwaway prototype.SDOT4_S4_2_4 opcode. Wire into compiler's sparsity pass.DOT16_S2; matches Snapdragon 8 Elite Gen 5 INT2 + FP8.build/soong/cc/lto.go. Off by default; turn it on for our build.-fprofile-sample-use.system_server, surfaceflinger, zygote64, mediaserver, webview, chrome. Even 2-6% on top of FDO+LTO is meaningful for power.eliza-e1 per-mcpu cost model. Target geomean 5%+ above stock RVA23 baseline.planned_missing_deps. Pin CoreMark/STREAM/lmbench/fio build recipe.-Os / -O2 / -O3+AutoFDO matrix. Boot time and Activity#onCreate → first frame are headline numbers vendors publish.unsupported_ops, cpu_fallback_pct. Target hard <1% cpu_fallback for published model list.docs/npu/2028-targets.md).e1_npu_runtime.py) with bounded GEMM_S8/S4, scalar/packed dots, packed ReLU, sparse INT4 dot, scalar FP8 E4M3, descriptor ring with valid_owner/byte-count/scratch-offset.e1_npu_lowering.py) for matmul/conv2d/attention-QK/attention-AV/MLP/bias-add/residual-add/transformer-block over StableHLO/TFLite records (not real IR).riscv64-elf-gcc 16.1.0 + QEMU 11.L0_RTL_UNIT) with strict claim boundaries.-mcpu=eliza-e1, RVV intrinsics headers, ThinLTO, sample-PGO, basic-block-sections; checked-in build recipe pinning LLVM SHA.elizanpu MLIR dialect) consuming StableHLO/linalg and emitting descriptor ring. Replaces Python smoke with true tensor compiler..prof files per workload class.e1_npu_lowering.py cannot scale; needs MLIR/IREE.manifest.xml SHA — Google's removal of RISC-V from AOSP common kernel in 2024 needs answering with pinned branch.lower_attention_qk_smoke is exactly host-side fix-up that must disappear. Keep Python as unit test oracle only; build MLIR/IREE for real codegen.docs/toolchain/riscv64-cross-host.md Homebrew has no riscv64-linux-gnu-gcc on darwin-arm64. Userspace can only be built in Linux container. Mandate Linux container (Docker/OrbStack) as canonical compiler env.benchmark-simulator-critical-gap-audit.md: Docker base moving tag, no flake.lock, no LLVM SHA, no mobile_smoke.tflite checksum. Compiler tuning not reproducible until pinned.DOT4_FP8_E4M3 and DOT16_S2 are scalar-only. To hit 2028 target (80 TFLOPS FP8 peak, 900 TOPS INT2 peak) chip must add full tensor execution and compiler must lower into them. Today neither exists.