docs/doc/developer/backend/transcription.mdx
Omi's transcription system provides real-time speech-to-text conversion with speaker identification, multiple language support, and seamless integration with the conversation processing pipeline.
flowchart LR
subgraph Client["📱 Omi App"]
Audio[Audio Capture]
end
subgraph Backend["🖥️ Backend"]
WS["/v4/listen
WebSocket"]
Decode[Audio Decoder]
end
subgraph STT["🎧 STT Providers"]
DG[Deepgram Nova-3]
Soniox[Soniox]
SM[Speechmatics]
end
Audio -->|Binary stream| WS
WS --> Decode
Decode --> DG
Decode -.->|Fallback| Soniox
Decode -.->|Fallback| SM
DG -->|Transcript| WS
WS -->|JSON segments| Audio
wss://api.omi.me/v4/listen?uid={uid}&language={lang}&sample_rate={rate}&codec={codec}
User ID obtained from Firebase authentication. Required for all connections.
Language code for transcription. Supports:
- Standard codes: `'en'`, `'es'`, `'fr'`, `'de'`, `'ja'`, `'zh'`, etc.
- Multi-language: `'multi'` for automatic language detection (uses Soniox)
Audio sample rate in Hz. Common values: `8000`, `16000`, `44100`, `48000`
Audio codec. Supported options:
- `pcm8` - 8-bit PCM (default)
- `pcm16` - 16-bit PCM
- `opus` - Opus codec (16kHz)
- `opus_fs320` - Opus with 320 frame size
- `aac` - AAC codec
- `lc3` - LC3 codec
- `lc3_fs1030` - LC3 with 1030 frame size
Number of audio channels. Use `1` for mono, `2` for stereo.
Enable speaker identification using the user's stored speech profile. When enabled, the system uses a dual-socket architecture for improved speaker detection.
Seconds of silence before the conversation is automatically processed. After this timeout, the conversation is saved and LLM processing begins.
Explicitly specify STT service. Options: `deepgram`, `soniox`, `speechmatics`. If not specified, the system selects based on language.
Enable custom STT mode. When set to `'enabled'`, the backend accepts app-provided transcripts instead of using STT services. Useful for apps with their own transcription.
Conversation source identifier. Examples: `'omi'`, `'openglass'`, `'phone'`
The system supports multiple audio codecs with automatic decoding:
| Codec | Sample Rate | Description | Use Case |
|---|---|---|---|
pcm8 | 8kHz | 8-bit PCM | Default, low bandwidth |
pcm16 | 16kHz | 16-bit PCM | Better quality |
opus | 16kHz | Opus encoded | Efficient compression |
opus_fs320 | 16kHz | Opus 320 frame | Alternative frame size |
aac | Variable | AAC encoded | iOS compatibility |
lc3 | Variable | LC3 codec | Bluetooth audio |
lc3_fs1030 | Variable | LC3 1030 frame | Alternative LC3 |
The system automatically selects the best STT provider based on language:
flowchart TD
Start[Incoming Audio] --> Lang{Language?}
Lang -->|English| DG3[Deepgram Nova-3]
Lang -->|Multi/Auto| Soniox[Soniox
95+ languages]
Lang -->|Unsupported by Nova-3| DG2[Deepgram Nova-2]
DG3 -->|Fallback on error| DG2
DG2 -->|Fallback on error| SM[Speechmatics]
| Provider | Languages | Model | Best For |
|---|---|---|---|
| Deepgram Nova-3 | 30+ | nova-3 | Primary English, major languages |
| Deepgram Nova-2 | 40+ | nova-2-general | Broader language support |
| Soniox | 95+ | Real-time | Multi-language, auto-detection |
| Speechmatics | 50+ | Real-time | Additional coverage |
When using Deepgram, the following options are configured:
| Option | Value | Purpose |
|---|---|---|
punctuate | true | Automatic punctuation insertion |
no_delay | true | Minimize latency for real-time feedback |
endpointing | 300 | 300ms silence to detect sentence boundaries |
interim_results | false | Only return final transcripts |
smart_format | true | Format numbers, dates, currencies |
profanity_filter | false | Keep all words unfiltered |
diarize | true | Enable speaker identification |
filler_words | false | Remove "um", "uh", etc. |
encoding | linear16 | 16-bit PCM encoding |
Build your own transcription/diarization WebSocket service that integrates with Omi.
flowchart LR
subgraph App["📱 Omi App"]
Capture[Audio Capture]
end
subgraph Custom["🎧 Your STT Service"]
WS[WebSocket Server]
end
subgraph Backend["🖥️ Omi Backend"]
API["/v4/listen"]
end
Capture -->|Binary audio| WS
WS -->|JSON transcripts| Capture
Capture -->|suggested_transcript| API
| Message | Format | Description |
|---|---|---|
| Audio frames | Binary | Raw audio bytes (codec configured by app, typically opus 16kHz) |
{"type": "CloseStream"} | JSON | End of audio stream |
Format: JSON object with segments array
{
"segments": [
{
"text": "Hello, how are you?",
"speaker": "SPEAKER_00",
"start": 0.0,
"end": 1.5
},
{
"text": "I'm doing great, thanks!",
"speaker": "SPEAKER_01",
"start": 1.6,
"end": 3.2
}
]
}
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Transcribed text |
speaker | string | No | Speaker label (SPEAKER_00, SPEAKER_01, etc.) |
start | float | No | Start time in seconds |
end | float | No | End time in seconds |
sequenceDiagram
participant App as 📱 Omi App
participant Backend as 🖥️ Backend
participant Socket1 as 🎧 Deepgram Socket 1
participant Socket2 as 🎧 Deepgram Socket 2
Note over Backend: User has speech profile
(duration: 30 seconds)
App->>Backend: Connect WebSocket
Backend->>Socket1: Create (for live audio)
Backend->>Socket2: Create (for profile training)
loop First 30 seconds
App->>Backend: Audio chunk
Backend->>Socket2: Send to profile socket
Socket2-->>Backend: Transcript (training)
end
Note over Backend: Profile duration elapsed
Backend->>Socket2: Close socket
loop Remaining session
App->>Backend: Audio chunk
Backend->>Socket1: Send to live socket
Socket1-->>Backend: Transcript with speaker IDs
Backend-->>App: Segments (is_user: true/false)
end
Raw audio bytes encoded according to the `codec` parameter. Sent continuously during recording.
```
[Binary audio chunk - varies by codec]
```
**Keep-alive:** Messages of 2 bytes or less are treated as heartbeat pings.
Assign a known person to detected speakers:
```json
{
"type": "speaker_assigned",
"speaker_id": 1,
"person_id": "person-uuid-here",
"person_name": "John",
"segment_ids": ["seg-uuid-1", "seg-uuid-2"]
}
```
When `custom_stt=enabled`, apps can provide their own transcripts:
```json
{
"type": "suggested_transcript",
"segments": [
{
"text": "Hello there",
"speaker": "SPEAKER_00",
"speaker_id": 0,
"start": 0.0,
"end": 1.5,
"is_user": true,
"person_id": "known-person-uuid-or-null"
}
],
"stt_provider": "custom-provider-name"
}
```
See [External Custom STT Service](#external-custom-stt-service) for building your own transcription service.
For OpenGlass and visual captures:
```json
{
"type": "image_chunk",
"id": "temp-image-id",
"index": 0,
"total": 3,
"data": "base64-encoded-chunk"
}
```
Real-time transcript segments as they're detected:
```json
[
{
"id": "uuid-string",
"text": "Hello there",
"speaker": "SPEAKER_00",
"speaker_id": 0,
"is_user": true,
"person_id": null,
"start": 0.0,
"end": 1.5,
"speech_profile_processed": true,
"stt_provider": "deepgram"
}
]
```
Connection and service status updates:
```json
{
"type": "service_status",
"status": "ready",
"status_text": "Service Ready"
}
```
System suggests a known person for a detected speaker:
```json
{
"type": "speaker_label_suggestion",
"speaker_id": 1,
"person_id": "person-uuid",
"person_name": "John",
"segment_id": "segment-uuid"
}
```
Sent when conversation timeout triggers processing:
```json
{
"type": "memory_created",
"memory": {
"id": "conversation-uuid",
"structured": {
"title": "Meeting Discussion",
"overview": "..."
}
},
"messages": []
}
```
When translation is enabled:
```json
{
"type": "translation",
"segments": [
{
"id": "segment-uuid",
"translations": [
{"lang": "es", "text": "Hola ahí"}
]
}
]
}
```
Each transcript segment contains:
| Field | Type | Description |
|---|---|---|
id | string | Unique UUID for the segment |
text | string | Transcribed text content |
speaker | string | Speaker label ("SPEAKER_00", "SPEAKER_01", etc.) |
speaker_id | integer | Numeric speaker ID (0, 1, 2...) |
is_user | boolean | true if spoken by device owner |
person_id | string? | UUID of identified person (if matched) |
start | float | Start time in seconds |
end | float | End time in seconds |
speech_profile_processed | boolean | Whether speech profile was used for identification |
stt_provider | string? | Name of STT provider used |
stateDiagram-v2
[*] --> Connecting: WebSocket request
Connecting --> Authenticating: Connection accepted
Authenticating --> Ready: User validated
Authenticating --> Closed: Auth failed
Ready --> Streaming: Audio received
Streaming --> Streaming: More audio
Streaming --> Processing: Silence timeout
Processing --> Streaming: New audio
Processing --> Closed: Session complete
Ready --> Closed: Client disconnect
Streaming --> Closed: Client disconnect
note right of Processing
Conversation saved
LLM extracts structure
Memories extracted
end note
The system includes robust error handling:
| Error Type | Handling |
|---|---|
| STT Connection Failed | Exponential backoff retry (1s → 32s, 3 attempts) |
| Provider Error | Automatic fallback to next provider |
| Decode Error | Log and skip corrupted audio chunk |
| WebSocket Error | Clean close with appropriate code |
| Component | Path |
|---|---|
| WebSocket Handler | backend/routers/transcribe.py |
| Deepgram Integration | backend/utils/stt/streaming.py |
| Soniox Integration | backend/utils/stt/streaming.py |
| Audio Decoding | backend/routers/transcribe.py |
| Speech Profile | backend/utils/stt/speech_profile.py |
| VAD (Voice Activity) | backend/utils/stt/vad.py |
| Transcript Model | backend/models/transcript_segment.py |