Back to Eliza

2028 NPU Target

packages/chip/docs/npu/2028-targets.md

2.0.16.4 KB
Original Source

2028 NPU Target

This is the Eliza performance target for a best-in-class 2028 Android phone NPU. It is intentionally higher than the current e1_npu RTL, which remains an L0 unit demonstrator. The target is used to steer architecture, verification, compiler, Android HAL, and benchmark work without pretending the current repo has phone-class silicon.

Current Public Signals

The 2026 public SOTA direction is clear:

AnchorPublic signalDesign consequence
Qualcomm Snapdragon 8 Elite Gen 5Hexagon NPU is advertised as 37% faster, 16% better performance per watt, with INT2 and FP8 support.Low precision and mixed precision must be first-class, not a later extension.
MediaTek Dimensity 9500NPU 990 claims up to 56% lower peak power, over 2x token-generation speed, and a CIM-based efficient NPU.Data movement dominates. SRAM locality, compression, sparsity, and always-on efficiency matter as much as MAC count.
Samsung Exynos 2600NPU claims 113% better generative-AI performance with lower latency and power plus ExecuTorch support.The software stack must target real model deployment paths, not only synthetic GEMM.
Qualcomm Snapdragon X2 familyLaptop-class integrated NPU listings show 80 TOPS and 152 GB/s LPDDR5x bandwidth.A large-battery phone should target laptop-adjacent burst AI while enforcing mobile sustained-power gates.
Apple A19 ProA 16-core Neural Engine is paired with GPU Neural Accelerators.The AP should expose cooperative GPU/NPU scheduling for graphics-plus-AI workloads.

Sources are recorded in docs/spec-db/npu-2028-target.yaml.

Numeric Target

The target is not a marketing TOPS target. TOPS must be reported with precision, sparsity, thermal state, clock, power, memory bandwidth, and CPU fallback percentage.

Metric2028 target
Dense INT8 peakat least 160 TOPS
Dense INT8 sustainedat least 80 TOPS
Sparse INT4 peakat least 512 TOPS
Sparse INT4 sustainedat least 200 TOPS
INT2 / BitNet-class peakat least 900 TOPS
FP8 peakat least 80 TFLOPS
Sustained INT8 efficiencyat least 18 TOPS/W
NPU burst powerno more than 8 W
NPU sustained powerno more than 4.5 W
Local SRAMat least 64 MiB
Local SRAM bandwidthat least 20 TB/s aggregate
Shared system cacheat least 32 MiB
External memory bandwidthat least 180 GB/s
CPU fallbackno more than 1% of measured graph nodes

Architecture Direction

The 2028 NPU should be a tiled matrix/vector accelerator:

  • 8 to 16 compute tiles.
  • At least 4096 INT8 MAC units per tile, with INT4/INT2 packing paths.
  • At least 4 MiB local SRAM per tile.
  • Separate systolic matrix, vector activation, layout-transform, DMA, sparsity decode, and scalar-control engines.
  • IOMMU-isolated command buffers, deep queues, per-context fault isolation, and cache-coherent CPU submission.
  • Hardware support for transformer decode, prefill, convolution, camera AI, image generation, and always-on micro-NPU paths.

Software Direction

The NPU is only real when the software stack can use it:

  • AIDL HAL and fail-closed SELinux policy.
  • TFLite delegate and NNAPI or successor runtime integration.
  • StableHLO import through an MLIR pipeline.
  • IREE or TVM backend for repeatable lowering.
  • ExecuTorch/PyTorch export path for on-device model deployment.
  • Benchmark evidence with unsupported-op count, CPU fallback percentage, power traces, thermal traces, and exact model hashes.

Current Repo Gap

rtl/npu/e1_npu.sv is currently a scalar datapath plus a 64-byte scratchpad GEMM prototype. It now includes a packed signed INT4 dot-product opcode as the first low-precision primitive, but it is still missing the actual tensor NPU structure:

  • no tensor command queue,
  • no DMA-fed scratchpad,
  • no large SRAM,
  • no systolic array,
  • no sparse INT4 GEMM,
  • no INT2 or FP8 execution,
  • no compiler backend,
  • no Android accelerator delegate,
  • no area or power model,
  • no sustained hardware benchmark evidence.

The next implementation move is to replace the scalar GEMM prototype with a parameterized INT8/INT4 tile model and feed it through a descriptor-ring ABI.

Evidence Gate

The current repository must stay classified as L0_RTL_UNIT for NPU capability until a target report supplies all of the following:

EvidenceRequired content
TOPS/MAC countersmacs_per_inference, npu_cycles, npu_hz, observed_tops, and tops_formula derived from hardware counters
PrecisionActual delegate precision such as INT8, INT4, INT2, FP8, BF16, or FP16
DataflowNamed measured dataflow path, not only a GEMM math estimate
Descriptor queueQueue depth, descriptor head/tail completion, timeout/error behavior, and host runtime submission proof
DMAHardware tensor-streaming DMA path and bytes read/written by the NPU workload
Runtime countersCycles, MACs, ops, errors, unsupported ops, DMA read bytes, and DMA written bytes from the measured path
Android HAL / NNAPIAIDL HAL service proof, fail-closed SELinux policy, VTS/CTS results, e1-npu accelerator query, total/delegated node counts, zero CPU fallback, and zero unsupported ops
Model bindingExact model SHA-256 and transcript hashes
Power/thermalCalibrated power trace, thermal trace, frequency trace, calibration record, throttle state, and perf/W calculation with exact SHA-256 values

The Android gate starts from docs/benchmarks/capabilities/e1_npu_android_proof_manifest.template.json. It stays blocked until an external AOSP validation job fills real HAL, VINTF, SELinux, VTS, CTS, NNAPI query, and absent-device fail-closed artifacts with matching hashes. The sustained efficiency gate starts from docs/benchmarks/capabilities/e1_npu_power_thermal_manifest.template.json and stays blocked until calibrated trace artifacts exist.

For review, TOPS is bounded by the counter evidence:

text
observed_tops <= macs_per_inference * 2 / (npu_cycles / npu_hz) / 1e12

The scalar RTL cannot produce phone-class TOPS, sustained power, or Android delegate evidence. Passing scripts/check_npu_2028_targets.py means the repo keeps this distinction explicit; it does not mean the 2028 target is met.

Next Commands

sh
python3 scripts/check_npu_2028_targets.py
python3 scripts/check_platform_contract.py
python3 benchmarks/run_benchmarks.py plan --bench tflite_e1_npu --strict-missing
make npu-2028-target-check platform-contract-check