Back to Moonshine

Models

micro/models/README.md

0.0.636.3 KB
Original Source

Models

Pre-exported SpellingCNN and TinyVadCNN artifacts for the on-device spelling pipeline. The copies here use canonical, human-readable names for desktop inspection and re-embedding into firmware.

<!--TOC-->

System diagram

Both models consume normalised log-mel features computed separately (16 kHz PCM → Slaney filterbank → per-clip or per-hop normalisation). Neither .tflite / .onnx file includes the mel front-end — only the classifier head.

mermaid
flowchart LR
  subgraph listen["Always-on (32 ms hops)"]
    PCM1["16 kHz PCM"] --> MelS["Log-mel stream"]
    MelS --> VAD["TinyVadCNN"]
    VAD --> Seg["Segmenter"]
  end

  subgraph classify["On speech end (~1 s clip)"]
    Seg --> Clip["1 s int16 clip"]
    Clip --> MelB["Log-mel batch"]
    MelB --> STT["SpellingCNN"]
    STT --> Logits["51 logits"]
    Logits --> TTS["TTS"]
  end

  TTS --> Spk["spoken reply"]

SpellingCNN — letters, digits, and command words

51-way classifier for hyperarticulated isolated spoken tokens in a ~1 s window: the 26 letters az, the ten digit words zeronine, and 15 command/symbol words used by the voice WiFi-setup flow (capital, uppercase, star, dollar, underscore, exclamation, percent, delete, done, cancel, wifi, ip, yes, no, hey rp). Limited to isolated tokens — not NATO names, running speech, or multi-character strings.

Provenance

FieldValue
Source runrun_2026_06_24_19_08 (best epoch 45)
Training datalocal TTS wavs (ElevenLabs / Google / Polly) + People's Speech real-voice clips (packed parquet on the Hub), class-balanced sampling (--sampling-power 1.0)
Speaker-independent val98.47% top-1, 97.16% macro, 97.00% E-set
Held-out captured (220 real clips)90.91% top-1, 85.96% E-set

The int8 export is accuracy-neutral vs the fp32 PyTorch model on captured (90.91% / 85.96%, argmax parity on the export check). Supersedes the previous run_2026_06_18_18_43 export (which scored 89.55% / 85.96% on the same set).

Architecture

MobileNetV2-style inverted-residual CNN (model_size=1, width multiplier 1.0, ~1.02M params):

StageBlocksNotes
Stem3×3 conv, stride (2, 2)downsamples freq and time
Backbone7 IR stages: (1,16,1), (4,24,2), (4,32,3), (4,64,3), (4,96,2), (4,160,2), (4,240,1)expand · out-channels · repeats
Head1×1 conv (640) → global average pool → linear51 logits

pad_to_odd is enabled, so every stride-2 conv sees an odd input: PyTorch's symmetric padding=1 then matches TFLite SAME and folds into the conv, avoiding the standalone PAD op that would otherwise set the TFLM arena peak.

TFLM op set: PAD, DEPTHWISE_CONV_2D, CONV_2D, ADD, SUM, FULLY_CONNECTED, RESHAPE.

Per-inference compute: ~36 MMAC per forward pass on a (1, 1, 64, 128) log-mel input (int8 LiteRT estimate).

Files

FileFormatRole
spelling_cnn_mel_int8.tfliteint8 LiteRTprimary int8 deploy artifact, embedded into firmware (~1.28 MiB)
spelling_cnn_mel_fp32.tflitefp32 LiteRTreference / re-quantization
spelling_cnn_mel.onnxfp32 ONNXdesktop inspection
spelling_cnn_mel.shrunk.onnxint8 ONNXcompact desktop deploy
spelling_cnn_meta.jsonJSONclasses, audio dims

The spelling_cnn_letters_digits_mel* files are the superseded 36-class (letters + digits only) export, kept for reference; the embed/resolve tooling prefers the spelling_cnn_mel* names above.

Inputs and outputs

TensorShapeDtypeDescription
log_mel (in)(B, 1, 64, 128)fp32 (ONNX) / int8 (TFLM)per-clip normalised log-mel: 64 mels × 128 frames, 16 kHz, hop 125 (~1 s)
logits (out)(B, 51)fp32unnormalised class scores; apply softmax for probabilities

Class order is in spelling_cnn_meta.json (classes array).

TinyVadCNN — voice activity detection

Runs once per 32 ms hop while listening; a moving-average segmenter turns frame probabilities into speech boundaries.

Architecture

Compact MobileNetV2-style binary classifier (TinyVadCNN):

StageBlocksNotes
Stem3×3 conv, stride (2, 1)~20.6K params total
Backbone4 IR stages: (1,16,1), (3,24,1), (3,32,1), (3,48,1)freq-downsample, short time axis
Head1×1 conv → GAP → linear1 speech logit

The primary export uses head16 mixed precision: int8 conv body, int16 head_conv / pool / output (recovers accuracy at near-int8 size).

Per-inference compute: ~0.78 MMAC per forward pass on a (1, 1, 32, 16) log-mel window (781 KMAC).

Files

FileFormatRole
tinyvad_cnn_speech_mel_head16.tfliteint8/int16 LiteRTprimary int8 deploy artifact (64 KiB)
tinyvad_cnn_speech_mel_fp32.tflitefp32 LiteRTreference / re-quantization
tinyvad_cnn_speech_mel.onnxfp32 ONNXdesktop inspection
tinyvad_cnn_speech_meta.jsonJSONmel dims, hop

Inputs and outputs

TensorShapeDtypeDescription
log_mel (in)(B, 1, 32, 16)fp32 (ONNX) / int8 (TFLM)streaming normalised log-mel window: 32 mels × 16 frames
speech_logit (out)(B, 1)fp32 / int16raw logit; sigmoid(logit) → speech probability in [0, 1]

Streaming audio config (from tinyvad_cnn_speech_meta.json): 16 kHz, n_fft=512, hop 512 samples (32 ms), f_min=20, f_max=8000.