Back to Moonshine

Voice Activity Detection

micro/vad/README.md

0.0.633.7 KB
Original Source

Voice Activity Detection

On-device voice activity detection: an int8 TinyVadCNN plus a moving-average smoother and 1 s clip extractor. In the always-on path it runs once per ~32 ms hop, turning audio into speech-segment boundaries that the STT classifier then transcribes.

The module is split into two reusable pieces:

ClassRole
VadTFLM wrapper: (n_mels × window_frames) log-mel window → one speech probability in [0, 1]
VadSegmentersmoothing + segmentation: per-frame probabilities → segment [start, end) sample indices, storing no audio

The streaming log-mel front-end that feeds Vad is the feature-generation module's MelStreamer (one FFT per hop). The application composes MelStreamer → Vad → VadSegmenter; see examples/rp2350's audio path. Dependencies are only feature-generation and TFLM, so the VAD can drop into a different example on another platform unchanged.

<!--TOC-->

Public API

Single public header include/vad/vad.h:

cpp
spelling::Vad vad(model, model_size, arena, arena_size, n_mels, window_frames);
float* feats = vad.feature_scratch();
streamer.BuildModelInput(feats);
float p = vad.Predict(feats);            // speech probability

spelling::VadSegmenter seg(threshold, smooth_frames, hop,
                           look_behind_samples, max_segment_samples);
seg.Start();
if (seg.ProcessFrame(p) == spelling::VadEvent::kSpeechEnd) {
  spelling::ExtractClipFrontAligned(audio, n, seg.segment_start_sample(),
                                    seg.segment_end_sample(), clip, clip_len);
}

The Vad output logit is accepted as int8 (pure-int8 model) or int16 (the mixed-precision int8-body/int16-head export); Predict() dequantizes both and applies the sigmoid.

Memory & compute

ResourceSizeNotes
Flash (model)64 KiBint8 TinyVadCNN weights (vad_model_data.*)
Flash (front-end)~25 KiBVAD mel tables (vad_mel_tables.*)
RAM (arena peak)~36 KiBTFLM working set; shares the STT arena sequentially
RAM (segmenter)~0.3 KiBsmoothing ring + counters; no audio buffer
Heap0no dynamic allocation

VAD and STT never Invoke() at once, so the VAD arena use adds no extra static SRAM beyond the shared 384 KiB tensor arena provisioned by the app.

Latency @ 250 MHz

OperationLatencyCompute (approx.)Notes
Vad::Predict()~3.1 ms per 32 ms audio~0.8 MMAC per 32 ms audio (~25 MMAC/s in)int8 TinyVadCNN Invoke
VAD step (mel + infer)~3.5 ms per 32 ms audio~0.8 MMAC per 32 ms audio (~25 MMAC/s in)streaming mel + Predict

One Invoke() per 32 ms of audio — ~11× headroom at 250 MHz. The segmenter is a moving average (O(window)).

Tests

tests/vad_segmenter_test.cc (TFLM micro_test.h) covers segment detection, the look-behind pre-roll, trailing-segment flush, and front-aligned clip extraction. It runs on the host (segmenter logic only; the interpreter wrapper is built for the target).

Generating data

scripts/generate_vad_embedded_data.py emits vad_config.h, vad_mel_tables.{h,cc} (front-end, model-independent) and, given a model, vad_model_data.{h,cc}:

bash
# config + mel tables only (no model file needed)
python scripts/generate_vad_embedded_data.py --config-only

# also embed the checked-in int8 model
python scripts/generate_vad_embedded_data.py \
    --tflite ../models/tinyvad_cnn_speech_mel_head16.tflite

Output lands in examples/rp2350/generated/ by default.