docs/content/features/openai-realtime.md
LocalAI supports the OpenAI Realtime API which enables low-latency, multi-modal conversations (voice and text) over WebSocket.
To use the Realtime API, you need to configure a pipeline model that defines the components for Voice Activity Detection (VAD), Transcription (STT), Language Model (LLM), and Text-to-Speech (TTS).
Create a model configuration file (e.g., gpt-realtime.yaml) in your models directory. For a complete reference of configuration options, see [Model Configuration]({{%relref "advanced/model-configuration" %}}).
name: gpt-realtime
pipeline:
vad: silero-vad-ggml
transcription: whisper-large-turbo
llm: qwen3-4b
tts: tts-1
This configuration links the following components:
silero-vad-ggml) to detect when the user is speaking.whisper-large-turbo) to transcribe user audio.qwen3-4b) to generate responses.tts-1) to synthesize the audio response.Make sure all referenced models (silero-vad-ggml, whisper-large-turbo, qwen3-4b, tts-1) are also installed or defined in your LocalAI instance.
By default each stage runs to completion before the next begins: the whole utterance is transcribed, the full LLM reply is generated, then it is synthesized. Each stage can instead be streamed incrementally, which lowers the time-to-first-audio of a turn:
name: gpt-realtime
pipeline:
vad: silero-vad-ggml
transcription: whisper-large-turbo
llm: qwen3-4b
tts: tts-1
streaming:
llm: true # stream LLM tokens as transcript deltas
tts: true # emit audio deltas per synthesized chunk
transcription: true # stream transcript text deltas of the user's speech
clause_chunking: true # synthesize each clause as soon as it completes
response.output_audio.delta per audio chunk the TTS backend produces (requires a backend that supports streaming synthesis), instead of one delta for the whole utterance. Falls back to a single unary delta otherwise.conversation.item.input_audio_transcription.delta events as the transcript is produced (requires a transcription backend that supports streaming).response.output_audio_transcript.delta events. The full reply is buffered and synthesized once it is complete — streamed as audio chunks when streaming.tts is enabled (and the TTS backend supports it), otherwise as a single unary delta. Reasoning/thinking is always stripped from the spoken transcript. Tool calls are supported while streaming when the LLM uses its tokenizer template (use_tokenizer_template: true): the backend's autoparser then delivers content and tool calls separately, so the spoken transcript never leaks tool-call tokens. Grammar-based function calling keeps the buffered path.。!? with no whitespace), CJK clause punctuation (,、;:), and Thai/Lao spaces — it does not rely on whitespace sentence boundaries, so it works for languages such as Chinese, Japanese and Thai where the old per-sentence approach degraded to whole-message buffering. Requires streaming.llm; scripts that genuinely need a dictionary (e.g. Khmer, Burmese) simply stay buffered until a space or end-of-message. Off by default.All streaming flags are off by default, so existing pipelines are unaffected.
For reasoning models, you can force the pipeline LLM's thinking off without editing the LLM model config:
pipeline:
llm: qwen3-4b
disable_thinking: true # maps to enable_thinking=false for the realtime LLM
This is applied only to the realtime session's copy of the LLM config, so it does not affect other users of the same model. Leave it unset to use the LLM model config's own reasoning settings.
The Realtime API supports two transports: WebSocket and WebRTC.
Connect to the WebSocket endpoint:
ws://localhost:8080/v1/realtime?model=gpt-realtime
Audio is sent and received as raw PCM in the WebSocket messages, following the OpenAI Realtime API protocol.
The WebRTC transport enables browser-based voice conversations with lower latency. Connect by POSTing an SDP offer to the REST endpoint:
POST http://localhost:8080/v1/realtime?model=gpt-realtime
Content-Type: application/sdp
<SDP offer body>
The response contains the SDP answer to complete the WebRTC handshake.
WebRTC uses the Opus audio codec for encoding and decoding audio on RTP tracks. The opus backend must be installed for WebRTC to work. Install it from the model gallery:
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"id": "opus"}'
Or set the EXTERNAL_GRPC_BACKENDS environment variable if running a local build:
EXTERNAL_GRPC_BACKENDS=opus:/path/to/backend/go/opus/opus
The opus backend is loaded automatically when a WebRTC session starts. It does not require any model configuration file — just the backend binary.
By default pion gathers a host ICE candidate for every local interface. Under
Docker host networking that includes bridge addresses (docker0/veth,
172.x) that a remote browser cannot route to: the call typically connects on a
good candidate and then drops a few seconds later when ICE consent checks fail on
the unreachable ones. Two settings let you advertise only the reachable address:
# Advertise these IPs as the host ICE candidates (e.g. the host's LAN IP)
LOCALAI_WEBRTC_NAT_1TO1_IPS=192.168.1.10
# ...or restrict ICE gathering to specific interfaces
LOCALAI_WEBRTC_ICE_INTERFACES=eth0
{{% notice tip %}}
For a browser on another LAN machine talking to LocalAI in a host-networked
container, set LOCALAI_WEBRTC_NAT_1TO1_IPS to the host's LAN IP. This is the
most reliable fix for WebRTC connections that establish and then drop.
{{% /notice %}}
The API follows the OpenAI Realtime API protocol for handling sessions, audio buffers, and conversation items.