README.md
Pure Go voice activity detection using the Silero VAD neural network.
No CGo. No ONNX runtime. No external dependencies.
The model weights are embedded in the binary.
go get github.com/zserge/govad@latest
package main
import (
"fmt"
"github.com/zserge/govad"
)
func main() {
// Create a VAD detector (uses embedded weights)
v, err := govad.New()
if err != nil {
panic(err)
}
// Feed 512 float32 samples at 16 kHz per call
samples := make([]float32, govad.SamplesPerFrame)
// ... fill samples from your audio source ...
prob := v.Process(samples)
if prob > 0.5 {
fmt.Println("Speech detected!")
}
// Call Reset() between unrelated audio streams
v.Reset()
}
The examples/live-vad directory contains a complete real-time VAD demo
using malgo (miniaudio bindings):
cd examples/live-vad
go run . -threshold 0.5
It captures audio from your default microphone and prints speech/silence transitions in real time.
| Function | Description |
|---|---|
govad.New() | Create a detector with embedded default weights |
govad.NewFromFile(path) | Load weights from a file |
govad.NewFromReader(r) | Load weights from an io.Reader |
v.Process(samples) | Run inference on 512 samples, returns probability [0, 1] |
v.Reset() | Clear LSTM state for a new audio stream |
On Apple M1:
BenchmarkProcess-8 1911 632370 ns/op 10112 B/op 7 allocs/op
~632 µs per 32 ms frame — roughly 50× faster than real time.
The weights are exported from silero_vad_half.onnx (Silero VAD v5, 16 kHz only).
The architecture is:
Audio (512 samples, 16 kHz)
→ Reflect pad (64 right)
→ Conv-STFT (n_fft=256, hop=128)
→ Magnitude spectrum
→ Conv1d(129→128, k=3) + ReLU
→ Conv1d(128→64, k=3, stride=2) + ReLU
→ Conv1d(64→64, k=3, stride=2) + ReLU
→ Conv1d(64→128, k=3) + ReLU
→ LSTMCell(128)
→ ReLU → Linear(128→1) → Sigmoid
→ Speech probability
The Go code is MIT licensed. The model weights are from Silero VAD, also MIT licensed.