packages/chip/docs/toolchain/quantization-pipeline.md
The quantization pipeline produces calibration manifests consumed by the elizanpu IREE backend. Five formats target the NPU's hardware opcodes:
| Format | Hardware path | Default use |
|---|---|---|
| PTQ INT8 (per-channel weights / per-tensor activations) | GEMM_S8, DOT4_S8 | dense default for most CNN / small transformer |
| AWQ INT4 weight-only | DOT8_S4 | LLM weights (best PPL at 3-4 bit) |
| GPTQ INT4 weight-only | DOT8_S4 | fallback for non-LLM small-batch |
| FP8 E4M3 | DOT4_FP8_E4M3 (scalar contract today; tensor path BLOCKED) | long-context LLM where INT8/INT4 PPL degrades |
| 2:4 structured sparse INT4 | SDOT4_S4_2_4 | dense matmul layers with 50% magnitude pruning |
| INT2 BitNet | DOT16_S2 (scalar contract today; tensor path BLOCKED) | experimental ultra-low-precision LLM |
Every calibrator emits a JSON manifest with a versioned schema string:
eliza.ptq_int8_manifest.v1eliza.awq_int4_manifest.v1eliza.gptq_int4_manifest.v1eliza.fp8_e4m3_manifest.v1eliza.sparse_2_4_int4_manifest.v1eliza.int2_bitnet_manifest.v1The IREE backend dispatches on the schema string at compile time.
record_* methods.build_manifest() and write the JSON to disk.iree-compile --iree-input-quantization-manifest=<path>.compiler/quantization/.