Back to Eliza

Benchmark, Toolchain, and Simulator Critical Gap Audit

packages/chip/docs/toolchain/benchmark-simulator-critical-gap-audit.md

2.0.111.4 KB
Original Source

Benchmark, Toolchain, and Simulator Critical Gap Audit

Date: 2026-05-17

Scope: benchmarks/**, sim/**, scripts/check_tools.sh, scripts/run_qemu.sh, scripts/run_renode.sh, Dockerfile, flake.nix, and docs/toolchain/**.

Status Terms

StatusMeaningStrict gate behavior
PASSThe required source, tool, artifact, and transcript exist for the named gate.Exit 0.
BLOCK / BLOCKEDThe repo scaffold is coherent, but external tools, generated assets, or run evidence are absent.Non-strict status checks may exit 0; strict checks must exit 2 or fail the caller.
FAILA checked-in file, semantic contract, schema, build, or executable run is wrong.Exit non-zero in all modes.
planned_missing_depsBenchmark dry-run command is valid, but one or more executable dependencies are absent.--strict-missing exits 2.
blockedBenchmark dry-run or run is blocked by release-visible model/data assets.--strict-missing exits 2.

Machine-readable status sources now include:

  • scripts/check_tools.sh --json, schema eliza.tool_status.v1.
  • benchmarks/run_benchmarks.py plan|run, schema eliza.benchmark_run.v1.
  • scripts/run_qemu.sh --check, STATUS: PASS|BLOCKED|FAIL qemu.* stage lines.
  • scripts/run_renode.sh --check, STATUS: PASS|BLOCKED|FAIL renode.* stage lines.

Missing Benchmark Tools and Assets

BenchmarkMissing tools/assetsCurrent machine statusRequired unblock
CoreMarkcoremark executable; target compiler flags; target clock and affinity metadata.planned_missing_deps when coremark is absent.Add a pinned CoreMark build recipe, place executable on target PATH, and record compiler/version/clock evidence.
STREAMstream_c.exe; array size policy; compiler flags; thread/affinity policy; memory clock evidence.planned_missing_deps when stream_c.exe is absent.Add fixed build flags and run metadata before using scores.
lmbench bandwidthbw_mem executable.planned_missing_deps when bw_mem is absent.Build lmbench for the target and archive raw stdout plus parsed metric.
lmbench latencylat_mem_rd executable.planned_missing_deps when lat_mem_rd is absent.Build lmbench for the target and archive stride sweep output.
fio sequential readfio; target filesystem/device identity; JSON parser is not required by config yet.planned_missing_deps when fio is absent.Install target fio, switch configs to JSON output, and record target storage topology.
fio random read/writefio; target filesystem/device identity; JSON parser is not required by config yet.planned_missing_deps when fio is absent.Same as sequential read; include random workload parameters in report metadata.
TFLite CPUbenchmark_model; benchmarks/models/mobile_smoke.tflite; pinned model SHA-256.blocked while the model is absent or placeholder-sized; may also report missing benchmark_model.Generate or supply redistributable .tflite, pin SHA-256, and archive benchmark_model build provenance.
TFLite e1 NPUbenchmark_model with NNAPI; mobile_smoke.tflite; real e1-npu NNAPI delegate/accelerator.blocked while model is absent; planned_missing_deps for binary in dry-run.Add model, NNAPI delegate evidence, accelerator name validation, and parser for latency output.
MLPerf MobileExternal checkout, APK/runner, datasets, Android target, device shell path.Documentation only; not represented in benchmark_plan.json.Add an external-run manifest before accepting MLPerf numbers.

The benchmark harness correctly refuses to mark a result passed if required executables are missing or model artifacts are unavailable. The remaining gap is metric quality: the configs plan commands and dependency checks, but do not yet pin build recipes, parsers, target thermal/power context, or sustained-run metadata.

Fake and Fallback Simulator Paths

AreaCurrent behaviorRiskRequired unblock
QEMU targetscripts/run_qemu.sh builds and runs a qemu-virt RISC-V firmware, not e1-chip hardware.A qemu-virt serial banner can be mistaken for e1-chip boot evidence.Keep docs and status lines saying software reference only; archive the ELF and transcript only as software-reference evidence.
QEMU non-strict checkMissing RISC-V compiler or qemu-system-riscv64 reports BLOCKED and exits 0 unless REQUIRE_QEMU=1.CI smoke can stay green while executable QEMU evidence is absent.Use make qemu-check-strict for release gates.
QEMU fake test pathscripts/test_qemu_smoke_status.py injects fake compiler/QEMU binaries to test status handling.Fake PASS could be misread as simulator evidence if logs are reused.Treat it only as unit coverage for status transitions; never archive it as boot proof.
Renode scaffoldsim/renode/eliza_e1.repl and .resc model the qemu-virt memory/UART map.It is not a real e1 hardware model and the interactive .resc is not boot evidence by itself.Add a bounded transcript capture and hardware-map model before using Renode as boot evidence.
Renode non-strict checkMissing renode, missing firmware, or missing real transcript intake reports BLOCKED and exits 0 unless REQUIRE_RENODE=1; scripts/run_renode.sh --check --transcript PATH only passes after archiving a transcript containing the expected banner.Smoke can pass only with semantic scaffold coverage plus an explicitly supplied transcript; the transcript still must come from a real Renode run.Use make renode-check-strict for release gates and archive build/reports/renode_smoke.manifest with any real transcript evidence.
Verilator/cocotb fallbackRTL tests are real fast-path checks, but they do not validate QEMU/Renode software boot.Passing RTL smoke can be overclaimed as system software readiness.Keep software/simulator claims tied to their own status artifacts.

Strict vs Non-Strict Gates

GateNon-strict behaviorStrict behavior
scripts/check_tools.shPrints PASS, BLOCK, and FAIL; exits 0 unless required fast-path tools/packages are missing and --strict is set.scripts/check_tools.sh --strict exits 1 on missing required fast-path tools or Python packages.
scripts/check_tools.sh --jsonEmits eliza.tool_status.v1 with per-tool status, tier, gate, required, and path_or_status.Combine with --strict to preserve the same exit policy.
benchmarks/run_benchmarks.py plan / --dry-runWrites a dry-run report with planned, planned_missing_deps, or blocked.--strict-missing exits 2 when any dependency or release-blocking asset is absent.
benchmarks/run_benchmarks.py runSkips blocked/missing workloads by recording blocked or missing_dependencies; real command failures exit 1.--strict-missing exits 2 for missing deps/assets and 1 for real failures.
make qemu-checkSemantic failures fail; missing compiler/QEMU is BLOCKED and exits 0.make qemu-check-strict sets REQUIRE_QEMU=1 and exits 2 on blocked executable smoke.
make renode-checkSemantic failures fail; missing Renode/firmware/real transcript intake is BLOCKED and exits 0.make renode-check-strict sets REQUIRE_RENODE=1 and exits 2 on blocked executable smoke.
make smokeIncludes non-strict QEMU/Renode and benchmark dry-run checks.Not a release evidence gate.

Missing Reproducibility Dependencies

ComponentGapRiskRequired unblock
Docker baseubuntu:24.04 is a moving tag and apt package versions are not frozen.Rebuilding later can change Verilator/Yosys/QEMU/compiler behavior.Pin base image digest and archive apt package manifest.
Docker benchmark stackFast image omits fio, lmbench, CoreMark, STREAM, TFLite benchmark_model, Renode, OpenLane, Magic, Netgen, KiCad CLI, OpenOCD, and sigrok.Tool inventory may show benchmark and heavy flow blocks on a clean Docker path.Add a separate benchmark/heavy image or document target-rootfs installation manifests.
Python requirementsrequirements.txt is bounded but not hash-locked.Local and container Python packages can drift.Add lock/constraints with hashes for accepted evidence paths.
Nixnixos-unstable floats and flake.lock is absent.nix develop is not reproducible.Run and commit nix flake lock once Nix is a supported gate.
OpenLane2 bootstrapScript clones default branch under external/openlane2.PD evidence can change without repo changes.Pin tag/SHA and recursive dependency manifest.
Chipyard bootstrapScript clones default branch under external/chipyard.Generator evidence can drift.Pin release/SHA plus submodule manifest.
OSS CAD SuiteLocal path discovery only; no archive URL/checksum.Host fallback versions are not replayable.Pin release URL and checksum if used as canonical host toolchain.
QEMU firmware toolchainriscv64-unknown-elf-gcc, riscv64-elf-gcc, or riscv64-linux-gnu-gcc is accepted.Different compilers can produce different ELF behavior.Record compiler path/version and archive built ELF hash.
RenodeLocal renode from PATH; no version pin.Transcript behavior can vary across installs.Record version and use a pinned install path for release.
Benchmark modelsmobile_smoke.tflite is absent and no SHA is pinned.TFLite runs cannot be reproduced or compared.Generate/commit a redistributable model or store it as an explicit release asset with SHA-256.

GUI and Non-CLI Risks

Tool/flowCLI statusRiskMitigation
KiCadkicad-cli is discoverable, but no project exists.Manual schematic/PCB GUI work could be claimed without exported artifacts.Require kicad-cli ERC/DRC/plot/export once a .kicad_pro is checked in.
GTKWaveGUI-oriented optional debug tool.Waveform review is not reproducible as evidence.Treat gtkwave as debug only; archive simulator logs and VCD/FST files instead.
Docker Desktop on macOSCLI can drive builds, daemon is host-managed.GUI daemon state or missing engine can block headless runs.Record Docker CLI/daemon versions and image digest.
AOSP/CuttlefishMostly CLI, but host KVM/device services are external.AOSP boot proof can depend on host setup not captured in repo.Add transcript parsers for lunch, build, and first boot once checkout exists.
OpenLane/OpenROAD/KLayout/MagicHeadless-capable, but often inspected through GUI locally.Visual signoff can bypass repo artifacts.Require report/GDS/DEF/DRC/LVS manifests from CLI runs only.
OpenOCD/sigrokCLI-capable but no board profile exists.Manual probe sessions are not replayable.Add board config and capture scripts before using lab evidence.
FreeCAD/mechanicalFreeCADCmd exists but no model is checked in.GUI-only mechanical edits are invisible to CI.Require command-line export/check scripts once mechanical models exist.

Required Follow-Up Checks

  1. Add benchmark JSON parsers for fio, CoreMark, STREAM, lmbench, and TFLite output.
  2. Add build recipes and version capture for CoreMark, STREAM, lmbench, fio, and benchmark_model.
  3. Pin Docker/Nix/OpenLane/Chipyard inputs before using their outputs as release evidence.
  4. Archive QEMU/Renode transcripts under build/reports/ only when they come from real tools, not fake status tests.