micro/feature-generation/README.md
A portable, heap-free log-mel spectrogram front-end for microcontrollers. It turns 16 kHz PCM into the normalised log-mel features that the VAD and STT models consume, using single-precision kissfft and a precomputed Slaney mel filterbank baked into flash. The batch and streaming front-ends produce bit-identical normalised log-mel features for the same audio.
The module is dependency-light on purpose — kissfft (real FFT) and TFLM's
micro_log (only for a fatal-misconfig MicroPrintf) — so it can be reused by
any front-end on any platform that provides those two.
| Class | Use | Cost |
|---|---|---|
LogMelSpectrogram | one-shot ("batch"): a whole clip → (n_mels × target_frames) plane | one FFT per output frame |
MelStreamer | always-on streaming: slides one non-overlapping n_fft block per call | exactly one FFT per hop |
MelStreamer keeps the last window_frames log-mel columns in a ring and only
FFTs the newest block, so the streaming VAD front-end costs ~window_frames×
less FFT work than recomputing the whole window every hop. With hop == n_fft
(center=false) the two front-ends are bit-parity (see the unit test).
Everything is in the single public header
include/feature_generation/feature_generation.h:
spelling::LogMelParams p{}; // n_fft, hop, n_mels, f_min/f_max, ...
p.precomputed_window = kMelWindow; // flash tables (no heap, no boot trig)
p.precomputed_nz_off = kMelNzOff; // CSR Slaney filterbank
// ...
spelling::LogMelSpectrogram lm(p);
lm.Compute(waveform, n_samples, out); // out: n_mels * target_frames floats
spelling::MelStreamer s(n_mels, window_frames, n_fft, window,
nz_off, nz_idx, nz_val, fft);
s.PushHop(block); // one FFT
s.BuildModelInput(out); // normalised (n_mels × window_frames)
The filterbank is stored only in compact CSR form (nz_off/idx/val) — the
dense (n_mels × n_freq) matrix is never materialised because that single
allocation (~64 KB at 64 mels) would overflow a small MCU heap.
| Resource | Size | Notes |
|---|---|---|
| Flash | ~26 KiB | precomputed Hann window + CSR Slaney filterbank (mel_tables.*) |
| RAM (static) | ~5 KiB | shared .bss FFT scratch pool (src/fft_scratch.h), not on the stack |
| RAM (streamer) | ≤ 4 KiB | MelStreamer ring: n_mels × window_frames floats |
| Heap | 0 | deployed path uses flash tables only; host-test fallback allocates once |
The FFT scratch (frame_buf[n_fft] + spectrum + power_row) must live off the
4 KiB core stack — a stack-resident 512-pt frame would corrupt the concurrent
dual-core GEMM on the RP2350.
Two deployment paths, both at 16 kHz:
| Path | Latency | Compute (approx.) | Notes |
|---|---|---|---|
| VAD — mel for 32 ms of audio | ~0.4 ms per 32 ms audio | ~12 KMAC per 32 ms audio (~0.4 MMAC/s in) | always-on listening; one 512-sample block |
| STT — mel for 1 s of audio | ~40 ms per 1 s audio | ~1.5 MMAC per 1 s audio (~1.5 MMAC/s in) | 64×128 log-mel plane after speech ends |
At 250 MHz the VAD path is sub-millisecond per 32 ms of input (dominated by one real FFT). The STT path processes the full 1 s capture in one shot.
tests/feature_generation_test.cc (TFLM micro_test.h) checks the Hann window,
the Slaney mel round-trip, and the streaming-vs-batch parity invariant. It runs
on the host (logic only, no interpreter).
scripts/generate_mel_tables.py emits mel_tables.{h,cc} for a given front-end
config, model-independent:
python scripts/generate_mel_tables.py \
--sample-rate 16000 --n-fft 512 --win-length 512 \
--n-mels 64 --f-min 20 --f-max 8000 \
--prefix kMel --const-prefix kMelTable --basename mel_tables \
--out-dir ../examples/rp2350/generated