docs/content/features/audio-classification.md
+++ disableToc = false title = "Sound Classification" weight = 18 url = "/features/audio-classification/" +++
Sound-event classification (audio tagging) answers the question "what am I hearing?" - given an audio clip, it returns a list of scored AudioSet labels (e.g. Baby cry, infant cry, Glass breaking, Dog bark, Alarm).
LocalAI exposes this through the /v1/audio/classification endpoint, modelled after /v1/audio/transcriptions. The reference backend is ced.cpp (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF.
Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side.
POST /v1/audio/classification
Content-Type: multipart/form-data
| Field | Type | Description |
|---|---|---|
file | file (required) | audio file in any format ffmpeg accepts |
model | string (required) | name of the sound-classification-capable model (e.g. ced-base) |
top_k | int | number of top tags to return (0 = backend default) |
threshold | float | drop tags scoring below this value |
{
"model": "ced-base",
"detections": [
{"index": 23, "label": "Baby cry, infant cry", "score": 0.87},
{"index": 22, "label": "Crying, sobbing", "score": 0.41}
]
}
Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1.
curl http://localhost:8080/v1/audio/classification \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/clip.wav" \
-F model="ced-base" \
-F top_k=10