docs/doc/developer/backend/backend_deepdive.mdx
Omi is a multimodal AI assistant designed to understand and interact with users in a way that's both intelligent and human-centered. The backend plays a crucial role in this by:
This deep dive will walk you through the core elements of Omi's backend, providing a clear roadmap for developers and enthusiasts alike to understand its inner workings.
<Tabs> <Tab title="Quick Start" icon="rocket"> Jump to the [Quick Reference](#quick-reference) table to find what you need fast. </Tab> <Tab title="Full Deep Dive" icon="book"> Read through the complete documentation below to understand the architecture. </Tab> <Tab title="Visual Guides" icon="diagram-project"> See the [System Architecture](#system-architecture) flowchart and other diagrams throughout. </Tab> </Tabs>flowchart TD
subgraph Client
App["🔵 Omi App
(Flutter)"]
end
subgraph API["API Endpoints"]
WS["/v4/listen
WebSocket"]
Conv["/v1/conversations
REST"]
Chat["/v2/messages
REST"]
end
subgraph Routers
Trans["routers/transcribe.py"]
ConvR["routers/conversations.py"]
ChatR["routers/chat.py"]
end
subgraph Processing
DG["Deepgram API
Speech-to-Text"]
PC["process_conversation()"]
LG["LangGraph
Agentic System"]
end
subgraph Storage["Data Storage"]
FS[("Firestore
Conversations & Memories")]
Pine[("Pinecone
Vector Embeddings")]
Redis[("Redis
Cache & Metadata")]
GCS[("Google Cloud Storage
Binary Files")]
end
App --> WS
App --> Conv
App --> Chat
WS --> Trans
Conv --> ConvR
Chat --> ChatR
Trans --> DG
ConvR --> PC
ChatR --> LG
PC --> FS
PC --> Pine
PC --> GCS
LG --> FS
LG --> Pine
LG --> Redis
Trans --> GCS
| Need to... | Go to |
|---|---|
| Store a conversation | database/conversations.py |
| Query similar conversations | database/vector_db.py |
| Process LLM calls | utils/llm/ directory |
| Handle real-time audio | routers/transcribe.py |
| Manage caching | database/redis_db.py |
| Understand chat system | Chat System Architecture |
| Learn data models | Storing Conversations |
Let's trace the journey of a typical interaction with Omi, focusing on how audio recordings are transformed into stored conversations:
<Steps> <Step title="User Initiates Recording" icon="microphone"> The user starts a recording session using the Omi app, capturing a conversation or their thoughts. </Step> <Step title="WebSocket Connection" icon="plug"> The Omi app establishes a real-time connection with the backend at the `/v4/listen` endpoint in `routers/transcribe.py`. </Step> <Step title="Audio Streaming" icon="wave-pulse"> The app streams audio data continuously through the WebSocket to the backend. </Step> <Step title="Deepgram Processing" icon="ear-listen"> The backend forwards audio to Deepgram API for real-time speech-to-text conversion. </Step> <Step title="Live Feedback" icon="comment-dots"> Transcription results stream back through the backend to the app, displaying words as the user speaks. </Step> <Step title="Conversation Creation" icon="floppy-disk"> During the WebSocket connection, the backend creates an "in_progress" conversation stub in Firestore. As audio streams, transcript segments are continuously added to Firestore in real-time. When recording ends, the app sends a POST request to `/v1/conversations` with an empty body (`{}`). The backend retrieves the in-progress conversation from Firestore and processes it. </Step> <Step title="LLM Processing" icon="wand-magic-sparkles"> The `process_conversation` function uses OpenAI to extract: - **Title & Overview** - Summarizes the conversation - **Action Items** - Tasks and to-dos mentioned - **Events** - Calendar-worthy moments - **Memories** - Facts about the user </Step> <Step title="Storage" icon="database"> - **Firestore**: Stores the full conversation document with transcript segments and metadata - **Pinecone**: Stores the vector embedding for semantic search - **Redis**: Caches frequently accessed data (speech profile durations, enabled apps, user names) for performance - **Google Cloud Storage**: Stores binary files (speech profile audio, conversation recordings, photos) </Step> </Steps>| Field | Description |
|---|---|
title | A short, descriptive title |
overview | Concise summary of main points |
category | Work, personal, etc. |
action_items | Tasks or to-dos mentioned |
events | Calendar-worthy events |
memories | Facts about the user (stored separately) |
Now that you understand the general flow, let's dive deeper into the key modules and services that power Omi's backend.
database/conversations.py: The Conversation GuardianThis module is responsible for managing the interaction with Firebase Firestore, Omi's main database for storing conversations and related data.
Key Functions:
upsert_conversation: Creates or updates a conversation document in Firestore, ensuring efficient storage and handling of updates.get_conversation: Retrieves a specific conversation by its ID.get_conversations: Fetches a list of conversations for a user, allowing for filtering, pagination, and optional inclusion of discarded conversations.Firestore Structure:
Each conversation is stored as a document in Firestore with the following fields:
class Conversation(BaseModel):
id: str # Unique ID
created_at: datetime # Creation timestamp
started_at: Optional[datetime]
finished_at: Optional[datetime]
source: Optional[ConversationSource] # omi, phone, desktop, openglass, etc.
language: Optional[str]
status: ConversationStatus # in_progress, processing, completed, failed
structured: Structured # Contains extracted title, overview, action items, etc.
transcript_segments: List[TranscriptSegment]
geolocation: Optional[Geolocation]
photos: List[ConversationPhoto]
apps_results: List[AppResult]
external_data: Optional[Dict]
discarded: bool
deleted: bool
visibility: str # private, shared, public
# See StoringConversations.mdx for complete field reference
stateDiagram-v2
[*] --> in_progress: Recording starts
(WebSocket connects)
in_progress --> processing: Recording ends
(POST /v1/conversations
or timeout)
processing --> completed: Successfully processed
processing --> failed: Error occurred
completed --> [*]
failed --> processing: Retry
note right of processing
LLM extracts structure
Memories extracted
Embeddings generated
end note
Processing Triggers:
POST /v1/conversations with empty body to trigger immediate processingconversation_timeout parameter, default 120 seconds of silence)process_conversation() function to extract structure, memories, and embeddingsdatabase/vector_db.py: The Embedding ExpertThis module manages the interaction with Pinecone, a vector database used to store and query conversation embeddings.
Key Functions:
upsert_vector: Adds or updates a conversation embedding in Pinecone.upsert_vectors: Efficiently adds or updates multiple embeddings.query_vectors: Performs similarity search to find conversations relevant to a user query.delete_vector: Removes a conversation embedding.Pinecone's Role:
Pinecone's specialized vector search capabilities are essential for:
utils/llm/ Directory: The AI MaestroThis directory contains modules where the power of OpenAI's LLMs is harnessed for a wide range of tasks. It's the core of Omi's intelligence!
Key Files:
clients.py: LLM client configurations and embedding modelschat.py: Chat-related prompts and processingconversation_processing.py: Conversation analysis and structuringKey Functionalities:
OpenAI Integration:
text-embedding-3-large is used to generate vector embeddings for conversations and user queries.Why this is Essential:
utils/other/storage.py: The Cloud Storage ManagerThis module handles interactions with Google Cloud Storage (GCS), specifically for managing user speech profiles.
Key Functions:
upload_profile_audio(file_path: str, uid: str):
BUCKET_SPEECH_PROFILES environment variable.uid).get_profile_audio_if_exists(uid: str) -> str:
None if the profile does not exist.Usage:
upload_profile_audio function is called when a user uploads a new speech profile recording through the /v3/upload-audio endpoint (defined in routers/speech_profile.py).get_profile_audio_if_exists function is used to retrieve a user's speech profile when needed, for example, during speaker identification in real-time transcription or post-processing.database/redis_db.py: The Data SpeedsterRedis is an in-memory data store known for its speed and efficiency. The database/redis_db.py module handles Omi's interactions with Redis, which is primarily used for caching, managing user
settings, and storing user speech profiles.
Data Stored and Retrieved from Redis:
Key Functions:
| Function | Purpose |
|---|---|
set_speech_profile_duration | Cache speech profile duration (audio stored in GCS) |
get_speech_profile_duration | Retrieve cached speech profile duration |
set_user_has_soniox_speech_profile | Mark that user has Soniox speech profile |
get_user_has_soniox_speech_profile | Check if user has Soniox speech profile |
enable_app, disable_app | Manage app enable/disable states |
get_enabled_apps | Get user's enabled apps |
get_app_reviews | Retrieve reviews for an app |
cache_user_name, get_cached_user_name | Cache and retrieve user names |
Why Redis is Important:
routers/transcribe.py: The Real-Time Transcription EngineThis module is the powerhouse behind Omi's real-time transcription capabilities, allowing the app to convert spoken audio into text as the user is speaking. It leverages WebSockets for bidirectional communication with the Omi app and multiple STT services (Deepgram, Soniox, Speechmatics) for accurate and efficient transcription.
/v4/listen Endpoint: The Omi app initiates a WebSocket connection with the backend at the /v4/listen endpoint, which is defined in the websocket_endpoint function of routers/transcribe.py.Omi supports multiple Speech-to-Text (STT) services with automatic selection based on language support:
The backend automatically selects the appropriate STT service using get_stt_service_for_language() in utils/stt/streaming.py, which considers language support and the configured service priority order (set via STT_SERVICE_MODELS environment variable).
Processing Functions:
process_audio_dg() - Manages interaction with Deepgram API (found in utils/stt/streaming.py)process_audio_soniox() - Manages interaction with Soniox APIprocess_audio_speechmatics() - Manages interaction with Speechmatics APIThe audio chunks streamed from the Omi app are sent to the selected STT service API for transcription. The service's speech recognition models process the audio and return text results in real-time.
Deepgram Options Configuration: The process_audio_dg function configures various Deepgram options:
sequenceDiagram
participant App as 📱 Omi App
participant Backend as 🖥️ Backend
participant STT as 🎧 STT Service
(Deepgram/Soniox/Speechmatics)
App->>Backend: Connect WebSocket /v4/listen
Backend-->>App: Connection accepted
loop Real-time Transcription
App->>Backend: Audio chunk (binary)
Backend->>STT: Forward audio
STT-->>Backend: Transcription result
Backend-->>App: Text segment + speaker
Note right of App: Display words
as user speaks
end
App->>Backend: Close connection
Backend-->>App: Session ended
The numbered breakdown of this flow:
/v4/listen endpoint.websocket_endpoint function receives the audio chunks and selects the appropriate STT service (Deepgram, Soniox, or Speechmatics) based on the language using get_stt_service_for_language().process_audio_dg, process_audio_soniox, or process_audio_speechmatics).no_delay option in Deepgram and the efficient handling of data in the backend are essential for minimizing delays.from fastapi import APIRouter, WebSocket
# ... other imports ...
router = APIRouter()
@router.websocket("/v4/listen")
async def websocket_endpoint(websocket: WebSocket, uid: str, language: str = 'en', ...):
await websocket.accept() # Accept the WebSocket connection
# Start STT transcription (service selected automatically based on language)
transcript_socket = await process_audio_dg(uid, websocket, language, ...)
# Note: Could also be process_audio_soniox() or process_audio_speechmatics()
# depending on language support
# Receive and process audio chunks from the app
async for data in websocket.iter_bytes():
transcript_socket.send(data)
# ... other logic for speaker identification, error handling, etc.
For more detailed information on specific subsystems:
mindmap
root((Backend Deep Dive))
Conversations
StoringConversations.mdx
Data Models
Firestore Structure
Chat System
chat_system.mdx
LangGraph
Agentic Tools
Transcription
transcription.mdx
Deepgram
WebSockets
Setup
Backend_Setup.mdx
Environment
Dependencies