Back to Eliza

NPU command ABI

packages/chip/docs/arch/npu.md

2.0.17.6 KB
Original Source

NPU command ABI

The e1 NPU is a small synthesizable datapath behind a single-cycle MMIO control interface. Software programs operands, selects an opcode, starts the command, then polls CTRL_STATUS.done or waits for irq_npu.

This block is not a phone-class accelerator. It has only a local RTL descriptor ring and DRAM-to-scratchpad read path, with no IOMMU, cache coherency, tensor compiler backend, Android NNAPI delegate, production SRAM, or sustained TOPS/power evidence. It may be cited only as L0 RTL/unit evidence unless a higher-level report supplies the proof artifacts listed in docs/benchmarks/capabilities/README.md.

text
write OP_A
write OP_B
write ACC              ; optional, used by MAC/DOT4
write OPCODE
write CTRL_STATUS.start
poll or wait for irq_npu
read RESULT

OPCODE is read/write; readback returns the programmed low 4 bits. RESULT_HI contains the high word for MUL_LO and sign-extension for signed 32-bit MAC_S16/DOT4_S8/DOT8_S4 results. MAC_S16/DOT4_S8 results.

Implemented opcodes:

OpcodeNameResult
0ADDOP_A + OP_B
1SUBOP_A - OP_B
2MUL_LOlow 32 bits of unsigned OP_A * OP_B; high word in RESULT_HI
3MAC_S16signed low-16 multiply plus signed ACC
4DOT4_S8four packed signed INT8 products plus signed ACC
5MAX_U32unsigned max
6MIN_U32unsigned min
7DOT8_S4eight packed signed INT4 products plus signed ACC
8GEMM_S8bounded scratchpad INT8 GEMM tile, signed int32 output

Status bits:

BitNameMeaning
0busyCommand is executing
1doneCommand completed; also drives irq_npu
2errorUnsupported opcode was rejected

Write CTRL_STATUS[1] to clear done and error. Operands are latched when start is accepted; software should not rely on mid-command register writes affecting the in-flight operation.

Scratchpad GEMM prototype

GEMM_S8 is a concrete tile prototype, not a tensor subsystem. Software stages row-major signed INT8 A and B matrices into a 64-byte MMIO scratchpad and programs a bounded command. The datapath performs one signed INT8 multiply accumulate per cycle and writes row-major signed int32 C results back into the scratchpad. The current RTL bounds are M <= 3, N <= 3, K <= 7, further limited by the 64-byte scratchpad footprint.

Additional registers:

OffsetNameFields
0x20GEMM_CFGM[1:0], N[9:8], K[18:16]
0x24GEMM_BASEbyte bases: A[5:0], B[13:8], C[21:16]
0x28GEMM_STRIDEbyte strides: A[3:0], B[11:8], C[19:16]
0x2cPERF_UNSUPPORTED_OPSunsupported opcode/configuration counter
0x30CMD_PARAMbit 0 selects descriptor-submission mode
0x40DESC_BASEdescriptor ring base; must be 32-bit aligned
0x44DESC_HEADsoftware producer index, 3 bits
0x48DESC_TAILhardware/software consumer index, 3 bits
0x4cDESC_STATUSdescriptor status bits plus error index in bits [11:9]
0x50PERF_CYCLEScycles spent in active state
0x54PERF_MACSsigned INT8 MAC operations issued
0x58PERF_OPSaccepted operation counter
0x5cPERF_ERRORSrejected commands/configurations; write bit 0 to clear all perf counters
0x60DESC_TIMEOUT_COUNTcycles spent in the active descriptor engine
0x64DESC_BYTES_READdescriptor plus tensor-stream bytes accepted by the NPU read path
0x68DESC_BYTES_WRITTENdescriptor writeback bytes accepted by the NPU write path; always zero until writeback exists
0x6cDESC_READ_BEATSdescriptor plus tensor-stream read beats accepted
0x70DESC_WRITE_BEATSdescriptor writeback beats accepted; always zero until writeback exists
0x80-0xbcSCRATCH[0..15]16 little-endian 32-bit scratchpad words

For row-major A[M][K], B[K][N], and C[M][N], use A_STRIDE = K, B_STRIDE = N, and C_STRIDE = 4*N. C_BASE must be word-aligned. Invalid dimensions or scratchpad addresses complete with CTRL_STATUS.done|error set and increment PERF_ERRORS.

The full v0.1 NPU ABI should extend this pattern:

text
MMIO control registers
command queue
DMA descriptors
scratchpad allocation
INT8/INT4 GEMM commands
completion interrupt
performance counters

Current integration is still a prototype datapath model. When CMD_PARAM[0] is set and software writes CTRL_STATUS.start, the RTL validates base alignment and empty/non-empty queue state, then fetches four 32-bit descriptor words from the read-only m_axil_ar/r descriptor port for each visible queue entry. Descriptor word 0 carries opcode[3:0], stream_to_scratch[8], scratch_offset[21:16], byte_count[29:24], writeback_request[30], and valid_owner[31]. Software must set valid_owner before advancing DESC_HEAD; the current RTL rejects descriptors without this bit and leaves DESC_TAIL unchanged. Word 1 is the stream source byte address when streaming is enabled, or scalar OP_A otherwise. Words 2 and 3 are scalar OP_B and ACC, or reserved for streamed GEMM. The stream path is aligned 32-bit reads only and writes into the 64-byte scratchpad before launching the selected existing opcode.

DESC_STATUS[0] reports empty, [1] reports descriptor completion, [2] reports descriptor error, [3] reports autonomous timeout, [4] reports descriptor fetch read error, [5] reports tensor stream read/configuration error, [6] reports a descriptor missing the valid owner bit, [7] reports an unsupported writeback request, [8] reports descriptor engine busy, and [11:9] reports the descriptor index that faulted or completed. The three visible head/tail bits do not encode a full-ring condition. A missing descriptor or stream read response times out with CTRL_STATUS.done|error; read-response errors fail closed. The standalone DMA block tracks aligned 32-bit beat issue, byte completion, last source/destination addresses, and final write strobe, but NPU descriptor streaming uses the NPU read master and still has no writeback DMA path. Descriptors with writeback_request set are rejected before launch, and DESC_BYTES_WRITTEN/DESC_WRITE_BEATS remain zero.

Evidence gates

Before any e1-npu benchmark is treated as accelerator evidence, the report must include:

  • exact model SHA-256 and Android/Linux target identity,
  • NNAPI accelerator query showing e1-npu,
  • total/delegated NNAPI node count, zero CPU fallback, and zero unsupported ops,
  • precision actually used by the delegate,
  • dataflow name and description from the measured path,
  • DMA path plus bytes read and written by the NPU workload; current local RTL reports descriptor read counts and zero write counts because writeback requests fail closed,
  • descriptor queue depth, head/tail completion evidence, and timeout/error behavior for queued commands,
  • MACs per inference, NPU cycles, NPU clock, DMA byte counters, operation/error counters, observed TOPS, and the TOPS formula,
  • Android HAL service, SELinux fail-closed policy, VTS result, and CTS result when any Android accelerator claim is made,
  • transcript hashes for adb, NNAPI query, benchmark output, and DMA trace.

TOPS is a derived review field, not proof by itself:

text
observed_tops <= macs_per_inference * 2 / (npu_cycles / npu_hz) / 1e12

The current RTL cannot satisfy those gates because its measured GEMM output path is still the 64-byte scratchpad and descriptor stream reads have no writeback DMA, cache coherency, production queue ownership, or software-owned completion queue.