micro/vad/README.md
On-device voice activity detection: an int8 TinyVadCNN plus a moving-average smoother and 1 s clip
extractor. In the always-on path it runs once per ~32 ms hop, turning audio into
speech-segment boundaries that the STT classifier then transcribes.
The module is split into two reusable pieces:
| Class | Role |
|---|---|
Vad | TFLM wrapper: (n_mels × window_frames) log-mel window → one speech probability in [0, 1] |
VadSegmenter | smoothing + segmentation: per-frame probabilities → segment [start, end) sample indices, storing no audio |
The streaming log-mel front-end that feeds Vad is the
feature-generation module's MelStreamer (one FFT per
hop). The application composes MelStreamer → Vad → VadSegmenter; see
examples/rp2350's audio path. Dependencies are only feature-generation and
TFLM, so the VAD can drop into a different example on another platform unchanged.
Single public header include/vad/vad.h:
spelling::Vad vad(model, model_size, arena, arena_size, n_mels, window_frames);
float* feats = vad.feature_scratch();
streamer.BuildModelInput(feats);
float p = vad.Predict(feats); // speech probability
spelling::VadSegmenter seg(threshold, smooth_frames, hop,
look_behind_samples, max_segment_samples);
seg.Start();
if (seg.ProcessFrame(p) == spelling::VadEvent::kSpeechEnd) {
spelling::ExtractClipFrontAligned(audio, n, seg.segment_start_sample(),
seg.segment_end_sample(), clip, clip_len);
}
The Vad output logit is accepted as int8 (pure-int8 model) or int16 (the
mixed-precision int8-body/int16-head export); Predict() dequantizes both and
applies the sigmoid.
| Resource | Size | Notes |
|---|---|---|
| Flash (model) | 64 KiB | int8 TinyVadCNN weights (vad_model_data.*) |
| Flash (front-end) | ~25 KiB | VAD mel tables (vad_mel_tables.*) |
| RAM (arena peak) | ~36 KiB | TFLM working set; shares the STT arena sequentially |
| RAM (segmenter) | ~0.3 KiB | smoothing ring + counters; no audio buffer |
| Heap | 0 | no dynamic allocation |
VAD and STT never Invoke() at once, so the VAD arena use adds no extra static
SRAM beyond the shared 384 KiB tensor arena provisioned by the app.
| Operation | Latency | Compute (approx.) | Notes |
|---|---|---|---|
Vad::Predict() | ~3.1 ms per 32 ms audio | ~0.8 MMAC per 32 ms audio (~25 MMAC/s in) | int8 TinyVadCNN Invoke |
| VAD step (mel + infer) | ~3.5 ms per 32 ms audio | ~0.8 MMAC per 32 ms audio (~25 MMAC/s in) | streaming mel + Predict |
One Invoke() per 32 ms of audio — ~11× headroom at 250 MHz. The segmenter is a
moving average (O(window)).
tests/vad_segmenter_test.cc (TFLM micro_test.h) covers segment detection,
the look-behind pre-roll, trailing-segment flush, and front-aligned clip
extraction. It runs on the host (segmenter logic only; the interpreter wrapper
is built for the target).
scripts/generate_vad_embedded_data.py emits vad_config.h,
vad_mel_tables.{h,cc} (front-end, model-independent) and, given a model,
vad_model_data.{h,cc}:
# config + mel tables only (no model file needed)
python scripts/generate_vad_embedded_data.py --config-only
# also embed the checked-in int8 model
python scripts/generate_vad_embedded_data.py \
--tflite ../models/tinyvad_cnn_speech_mel_head16.tflite
Output lands in examples/rp2350/generated/ by default.