micro/stt/README.md
On-device speech-to-text for isolated letters, digits, and command words: a TFLM wrapper around the int8 mel-mode SpellingCNN. Given a normalised log-mel feature plane
(from the feature-generation module) it runs the
int8 classifier with CMSIS-NN kernels and returns dequantized logits,
which the included helpers turn into a labelled prediction.
The checked-in SpellingCNN is a 51-way classifier over isolated spoken
letters, digits, and command words. Class labels (from
examples/rp2350/generated/classes.*) are:
a, b, c, …, zzero, one, two, three, four, five, six, seven, eight, ninecapital, uppercase, star, dollar, underscore, exclamation, percent, delete, done, cancel, wifi, ip, yes, no, hey rpEach class is a single hyperarticulated token in a ~1 s window at 16 kHz. The
model supports isolated tokens only — not NATO/ICAO phonetic names,
spelled-out words, or continuous speech. Replacing the embedded .tflite and classes.* blobs (via
scripts/generate_embedded_data.py) swaps
the label set, but flash and arena sizing must be revalidated for a different
architecture or class count.
Custom vocabulary models for other deployments are available commercially from Moonshine AI.
Single public header include/stt/stt.h:
spelling::Classifier clf(model, model_size, arena, arena_size,
n_mels, target_frames, n_classes);
float* feats = clf.feature_scratch(); // borrowed from the arena overlay
log_mel.Compute(waveform, n_samples, feats);
float logits[n_classes];
clf.Run(feats, logits); // quantize -> Invoke -> dequantize
int pred = spelling::Argmax(logits, n_classes);
float prob = spelling::SoftmaxProb(logits, n_classes, pred);
The op set is locked to exactly what the exported model uses (PAD,
DEPTHWISE_CONV_2D, CONV_2D, ADD, SUM, FULLY_CONNECTED, RESHAPE); a
re-export with new ops fails loudly at AllocateTensors().
| Resource | Size | Notes |
|---|---|---|
| Flash (model) | ~1.3 MiB | int8 SpellingCNN weights (model_data.*) |
| RAM (arena peak) | ~346 KiB | TFLM working set; app provisions 384 KiB |
| RAM (features) | 0 extra | fp32 log-mel written into idle arena overlay |
| Heap | 0 | interpreter + resolver placement-newed into arena head (~1 KiB) |
Feature generation and inference share the same bytes — there is no separate feature buffer.
| Operation | Latency | Compute (approx.) | Notes |
|---|---|---|---|
Classifier::Run() (dual-core) | ~314 ms per 1 s audio | ~36 MMAC per 1 s audio (~36 MMAC/s in) | CMSIS-NN int8 SIMD |
Classifier::Run() (single-core) | ~507 ms per 1 s audio | ~36 MMAC per 1 s audio (~36 MMAC/s in) | same model, no core split |
MAC count is from the exported SpellingCNN graph structure (64×128 input). See the top-level README for the full pipeline breakdown.
tests/predictor_test.cc (TFLM micro_test.h) covers Argmax (incl. ties) and
the stable softmax. It runs on the host (helper logic only; the interpreter
wrapper is built for the target). scripts/desktop_parity.py reproduces the
on-device embedded-clip test loop on the desktop for regression checks.
scripts/generate_embedded_data.py reads the checked-in
models/spelling_cnn_mel_int8.tflite and its metadata sidecar,
then emits the example's embedded blobs (model_data, classes, mel_tables,
audio_config, test_clips):
python scripts/generate_embedded_data.py # 2 clips/class
python scripts/generate_embedded_data.py --clips-per-class 1
Output lands in examples/rp2350/generated/ by default. scripts/desktop_parity.py
reproduces the on-device run with ai_edge_litert and diffs per-clip
predictions against a captured pico_monitor.log.