speaker-recognition

Speaker (voice) recognition backend for LocalAI. The audio analog to insightface — produces speaker embeddings and supports 1:1 voice verification and voice demographic analysis.

Engines

SpeechBrainEngine (default): ECAPA-TDNN trained on VoxCeleb. 192-d L2-normalised embeddings, cosine distance for verification. Auto-downloads from HuggingFace on first LoadModel.
OnnxDirectEngine: Any pre-exported ONNX speaker encoder (WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes from the gallery files: entry.

Engine selection is gallery-driven: if the model config provides model_path: / onnx: the ONNX engine is used, otherwise the SpeechBrain engine.

Endpoints

POST /v1/voice/verify — 1:1 same-speaker check.
POST /v1/voice/embed — extract a speaker embedding vector.
POST /v1/voice/analyze — voice demographics, loaded lazily on the first analyze call:
- Emotion (default, opt-out): superb/wav2vec2-base-superb-er (Apache-2.0), 4-way categorical (neutral / happy / angry / sad).
- Age + gender (opt-in): no default — wire a checkpoint with a standard Wav2Vec2ForSequenceClassification head via age_gender_model:<repo> in options. The Audeering age-gender model is not usable as a drop-in because its multi-task head isn't loadable via AutoModelForAudioClassification.
Both heads are optional. When nothing loads, the engine returns 501.

Audio input

Audio is materialised by the HTTP layer to a temp wav before calling the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI, or raw base64. The backend itself always receives a filesystem path.