packages/chip/docs/arch/npu.md
The e1 NPU is a small synthesizable datapath behind a single-cycle MMIO
control interface. Software programs operands, selects an opcode, starts the
command, then polls CTRL_STATUS.done or waits for irq_npu.
This block is not a phone-class accelerator. It has only a local RTL descriptor
ring and DRAM-to-scratchpad read path, with no IOMMU, cache coherency, tensor
compiler backend, Android NNAPI delegate, production SRAM, or sustained
TOPS/power evidence. It may be cited only as L0 RTL/unit evidence unless a
higher-level report supplies the proof artifacts listed in
docs/benchmarks/capabilities/README.md.
write OP_A
write OP_B
write ACC ; optional, used by MAC/DOT4
write OPCODE
write CTRL_STATUS.start
poll or wait for irq_npu
read RESULT
OPCODE is read/write; readback returns the programmed low 4 bits. RESULT_HI
contains the high word for MUL_LO and sign-extension for signed 32-bit
MAC_S16/DOT4_S8/DOT8_S4 results.
MAC_S16/DOT4_S8 results.
Implemented opcodes:
| Opcode | Name | Result |
|---|---|---|
0 | ADD | OP_A + OP_B |
1 | SUB | OP_A - OP_B |
2 | MUL_LO | low 32 bits of unsigned OP_A * OP_B; high word in RESULT_HI |
3 | MAC_S16 | signed low-16 multiply plus signed ACC |
4 | DOT4_S8 | four packed signed INT8 products plus signed ACC |
5 | MAX_U32 | unsigned max |
6 | MIN_U32 | unsigned min |
7 | DOT8_S4 | eight packed signed INT4 products plus signed ACC |
8 | GEMM_S8 | bounded scratchpad INT8 GEMM tile, signed int32 output |
Status bits:
| Bit | Name | Meaning |
|---|---|---|
0 | busy | Command is executing |
1 | done | Command completed; also drives irq_npu |
2 | error | Unsupported opcode was rejected |
Write CTRL_STATUS[1] to clear done and error. Operands are latched when
start is accepted; software should not rely on mid-command register writes
affecting the in-flight operation.
GEMM_S8 is a concrete tile prototype, not a tensor subsystem. Software stages
row-major signed INT8 A and B matrices into a 64-byte MMIO scratchpad and
programs a bounded command. The datapath performs one signed INT8 multiply
accumulate per cycle and writes row-major signed int32 C results back into the
scratchpad. The current RTL bounds are M <= 3, N <= 3, K <= 7, further
limited by the 64-byte scratchpad footprint.
Additional registers:
| Offset | Name | Fields |
|---|---|---|
0x20 | GEMM_CFG | M[1:0], N[9:8], K[18:16] |
0x24 | GEMM_BASE | byte bases: A[5:0], B[13:8], C[21:16] |
0x28 | GEMM_STRIDE | byte strides: A[3:0], B[11:8], C[19:16] |
0x2c | PERF_UNSUPPORTED_OPS | unsupported opcode/configuration counter |
0x30 | CMD_PARAM | bit 0 selects descriptor-submission mode |
0x40 | DESC_BASE | descriptor ring base; must be 32-bit aligned |
0x44 | DESC_HEAD | software producer index, 3 bits |
0x48 | DESC_TAIL | hardware/software consumer index, 3 bits |
0x4c | DESC_STATUS | descriptor status bits plus error index in bits [11:9] |
0x50 | PERF_CYCLES | cycles spent in active state |
0x54 | PERF_MACS | signed INT8 MAC operations issued |
0x58 | PERF_OPS | accepted operation counter |
0x5c | PERF_ERRORS | rejected commands/configurations; write bit 0 to clear all perf counters |
0x60 | DESC_TIMEOUT_COUNT | cycles spent in the active descriptor engine |
0x64 | DESC_BYTES_READ | descriptor plus tensor-stream bytes accepted by the NPU read path |
0x68 | DESC_BYTES_WRITTEN | descriptor writeback bytes accepted by the NPU write path; always zero until writeback exists |
0x6c | DESC_READ_BEATS | descriptor plus tensor-stream read beats accepted |
0x70 | DESC_WRITE_BEATS | descriptor writeback beats accepted; always zero until writeback exists |
0x80-0xbc | SCRATCH[0..15] | 16 little-endian 32-bit scratchpad words |
For row-major A[M][K], B[K][N], and C[M][N], use A_STRIDE = K,
B_STRIDE = N, and C_STRIDE = 4*N. C_BASE must be word-aligned. Invalid
dimensions or scratchpad addresses complete with CTRL_STATUS.done|error set
and increment PERF_ERRORS.
The full v0.1 NPU ABI should extend this pattern:
MMIO control registers
command queue
DMA descriptors
scratchpad allocation
INT8/INT4 GEMM commands
completion interrupt
performance counters
Current integration is still a prototype datapath model. When CMD_PARAM[0] is
set and software writes CTRL_STATUS.start, the RTL validates base alignment
and empty/non-empty queue state, then fetches four 32-bit descriptor words from
the read-only m_axil_ar/r descriptor port for each visible queue entry.
Descriptor word 0 carries opcode[3:0], stream_to_scratch[8],
scratch_offset[21:16], byte_count[29:24], writeback_request[30], and
valid_owner[31]. Software must set valid_owner before advancing DESC_HEAD;
the current RTL rejects descriptors without this bit and leaves DESC_TAIL
unchanged. Word 1 is the stream source byte address when streaming is enabled,
or scalar OP_A otherwise. Words 2 and 3 are scalar OP_B and ACC, or
reserved for streamed GEMM. The stream path is aligned 32-bit reads only and
writes into the 64-byte scratchpad before launching the selected existing opcode.
DESC_STATUS[0] reports empty, [1] reports descriptor completion, [2]
reports descriptor error, [3] reports autonomous timeout, [4] reports
descriptor fetch read error, [5] reports tensor stream read/configuration
error, [6] reports a descriptor missing the valid owner bit, [7] reports an
unsupported writeback request, [8] reports descriptor engine busy, and
[11:9] reports the descriptor index that faulted or completed. The three
visible head/tail bits do not encode a full-ring condition. A missing descriptor
or stream read response times out with CTRL_STATUS.done|error; read-response
errors fail closed. The standalone DMA block tracks aligned 32-bit beat issue,
byte completion, last source/destination addresses, and final write strobe, but
NPU descriptor streaming uses the NPU read master and still has no writeback DMA
path. Descriptors with writeback_request set are rejected before launch, and
DESC_BYTES_WRITTEN/DESC_WRITE_BEATS remain zero.
Before any e1-npu benchmark is treated as accelerator evidence, the report
must include:
e1-npu,TOPS is a derived review field, not proof by itself:
observed_tops <= macs_per_inference * 2 / (npu_cycles / npu_hz) / 1e12
The current RTL cannot satisfy those gates because its measured GEMM output path is still the 64-byte scratchpad and descriptor stream reads have no writeback DMA, cache coherency, production queue ownership, or software-owned completion queue.