docs/websocket.md
This document describes the WebSocket communication protocol between the device and the server, based on the current code. When implementing a server, please cross-check with the actual implementation.
Device initialization
Application:
WebsocketProtocol) that implements the Protocol interface.Opening the WebSocket connection
OpenAudioChannel():
Authorization, Protocol-Version, Device-Id, Client-Id).Connect() to establish the WebSocket connection.Device sends a "hello" message
{
"type": "hello",
"version": 1,
"features": {
"mcp": true,
"aec": true
},
"transport": "websocket",
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}
features is optional and generated from compile-time configuration. For example, "mcp": true means the device supports MCP, and "aec": true is emitted when CONFIG_USE_SERVER_AEC is enabled.frame_duration matches OPUS_FRAME_DURATION_MS (typically 60 ms).Server replies with "hello"
"type" is "hello" and whose "transport" is "websocket".session_id; the device will store it.{
"type": "hello",
"transport": "websocket",
"session_id": "xxx",
"audio_params": {
"format": "opus",
"sample_rate": 24000,
"channels": 1,
"frame_duration": 60
}
}
transport matches, the device marks the audio channel as opened.Subsequent exchanges
Two kinds of data are sent in either direction:
In the code, the receive callback splits traffic as follows:
OnData(...):
binary is true, the payload is treated as an Opus frame and decoded.binary is false, the payload is parsed as JSON and dispatched by type.When the server or network drops, OnDisconnected() fires:
on_audio_channel_closed_() and eventually returns to the idle state.Closing the WebSocket connection
CloseAudioChannel() to tear down the socket and returns to idle.When establishing the WebSocket connection, the device sets the following headers:
Authorization: access token, usually formatted as "Bearer <token>".Protocol-Version: the protocol version number, matching the version field in the hello message.Device-Id: the physical MAC address of the device.Client-Id: a software-generated UUID (reset when NVS is erased or the full firmware is re-flashed).These headers are sent with the WebSocket handshake; the server can use them for authentication or bookkeeping.
The device supports several binary protocol versions, selected by the version field in settings:
Raw Opus frames with no extra metadata. The WebSocket layer already distinguishes text and binary frames.
Uses the BinaryProtocol2 structure:
struct BinaryProtocol2 {
uint16_t version; // protocol version
uint16_t type; // message type (0: OPUS, 1: JSON)
uint32_t reserved; // reserved
uint32_t timestamp; // timestamp in milliseconds (useful for server-side AEC)
uint32_t payload_size; // payload size in bytes
uint8_t payload[]; // payload
} __attribute__((packed));
Uses the BinaryProtocol3 structure:
struct BinaryProtocol3 {
uint8_t type; // message type
uint8_t reserved; // reserved
uint16_t payload_size; // payload size
uint8_t payload[]; // payload
} __attribute__((packed));
WebSocket text frames carry JSON. The most common "type" values and their semantics are listed below. Fields that are not listed may be implementation-specific or optional.
Hello
{
"type": "hello",
"version": 1,
"features": {
"mcp": true,
"aec": true
},
"transport": "websocket",
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}
Listen
"session_id": session identifier."type": "listen""state": "start", "stop", or "detect" (wake word detected)."mode": "auto", "manual", or "realtime".{
"session_id": "xxx",
"type": "listen",
"state": "start",
"mode": "manual"
}
Abort
{
"session_id": "xxx",
"type": "abort",
"reason": "wake_word_detected"
}
reason may be "wake_word_detected" or other implementation-defined values.Wake Word Detected
{
"session_id": "xxx",
"type": "listen",
"state": "detect",
"text": "Hi XiaoZhi"
}
MCP
type: "mcp" messages whose payload is JSON-RPC 2.0 (see MCP protocol document).{
"session_id": "xxx",
"type": "mcp",
"payload": {
"jsonrpc": "2.0",
"id": 1,
"result": {
"content": [
{ "type": "text", "text": "true" }
],
"isError": false
}
}
}
Hello
"type": "hello" and "transport": "websocket".audio_params, meaning the audio parameters the server expects / the canonical set agreed with the device.session_id which the device records.STT
{"session_id": "xxx", "type": "stt", "text": "..."}LLM
{"session_id": "xxx", "type": "llm", "emotion": "happy", "text": "😀"}TTS
{"session_id": "xxx", "type": "tts", "state": "start"}: the server is about to stream TTS audio. The device transitions to the speaking state.{"session_id": "xxx", "type": "tts", "state": "stop"}: the TTS segment is finished.{"session_id": "xxx", "type": "tts", "state": "sentence_start", "text": "..."}: show the current sentence on the UI (for example, subtitle display).MCP
payload structure follows JSON-RPC 2.0.tools/call example:
{
"session_id": "xxx",
"type": "mcp",
"payload": {
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "self.light.set_rgb",
"arguments": { "r": 255, "g": 0, "b": 0 }
},
"id": 1
}
}
System
{
"session_id": "xxx",
"type": "system",
"command": "reboot"
}
"reboot": reboot the device.Alert
Application::OnIncomingJson.{
"session_id": "xxx",
"type": "alert",
"status": "Warning",
"message": "Battery low",
"emotion": "sad"
}
status: short title displayed on screen.message: detailed message.emotion: emotion shown while alerting (e.g. "sad", "neutral").Custom (optional)
CONFIG_RECEIVE_CUSTOM_MESSAGE is enabled.{
"session_id": "xxx",
"type": "custom",
"payload": {
"message": "anything you want"
}
}
Binary audio frames
listening state are dropped to avoid conflicts with the microphone stream.Device uploads microphone audio
Device plays server audio
The device state machine is defined in main/device_state.h and includes:
kDeviceStateUnknownkDeviceStateStartingkDeviceStateWifiConfiguringkDeviceStateIdlekDeviceStateConnectingkDeviceStateListeningkDeviceStateSpeakingkDeviceStateUpgradingkDeviceStateActivatingkDeviceStateAudioTesting (factory / bring-up audio testing)kDeviceStateFatalError (non-recoverable error requiring user action)Idle -> Connecting
OpenAudioChannel(), sets up the WebSocket, and sends "type":"hello".Connecting -> Listening
SendStartListening(...) is called and microphone streaming begins.Listening -> Speaking
{"type":"tts","state":"start"}; the device stops sending mic audio and plays incoming TTS.Speaking -> Idle
{"type":"tts","state":"stop"}. When auto-continue is enabled the device transitions back to Listening; otherwise it returns to Idle.Listening / Speaking -> Idle (abort)
SendAbortSpeaking(...) or CloseAudioChannel() interrupts the session and closes the WebSocket.stateDiagram
direction TB
[*] --> kDeviceStateUnknown
kDeviceStateUnknown --> kDeviceStateStarting: Initialize
kDeviceStateStarting --> kDeviceStateWifiConfiguring: Configure WiFi
kDeviceStateStarting --> kDeviceStateActivating: Activate device
kDeviceStateActivating --> kDeviceStateUpgrading: New firmware detected
kDeviceStateActivating --> kDeviceStateIdle: Activation complete
kDeviceStateIdle --> kDeviceStateConnecting: Start connecting
kDeviceStateConnecting --> kDeviceStateIdle: Connection failed
kDeviceStateConnecting --> kDeviceStateListening: Connection succeeded
kDeviceStateListening --> kDeviceStateSpeaking: TTS start
kDeviceStateSpeaking --> kDeviceStateListening: TTS stop
kDeviceStateListening --> kDeviceStateIdle: Manual abort
kDeviceStateSpeaking --> kDeviceStateIdle: Auto stop
kDeviceStateStarting --> kDeviceStateAudioTesting: Factory audio test
kDeviceStateStarting --> kDeviceStateFatalError: Fatal error
stateDiagram
direction TB
[*] --> kDeviceStateUnknown
kDeviceStateUnknown --> kDeviceStateStarting: Initialize
kDeviceStateStarting --> kDeviceStateWifiConfiguring: Configure WiFi
kDeviceStateStarting --> kDeviceStateActivating: Activate device
kDeviceStateActivating --> kDeviceStateUpgrading: New firmware detected
kDeviceStateActivating --> kDeviceStateIdle: Activation complete
kDeviceStateIdle --> kDeviceStateConnecting: Start connecting
kDeviceStateConnecting --> kDeviceStateIdle: Connection failed
kDeviceStateConnecting --> kDeviceStateListening: Connection succeeded
kDeviceStateIdle --> kDeviceStateListening: Start listening
kDeviceStateListening --> kDeviceStateIdle: Stop listening
kDeviceStateIdle --> kDeviceStateSpeaking: Start speaking
kDeviceStateSpeaking --> kDeviceStateIdle: Stop speaking
Connection failure
Connect(url) fails or the server hello is not received before the timeout, on_network_error_() is invoked and the device shows a "cannot connect" alert.Server disconnect
OnDisconnected() is called:
on_audio_channel_closed_() runs.Authentication
Authorization: Bearer <token>; the server must validate it.Session scope
session_id, useful when the server serves multiple concurrent interactions.Audio payload
OPUS_FRAME_DURATION_MS (typically 60 ms). The server may use 24 kHz on the downlink for better music playback.Binary protocol version selection
version setting:
Protocol-Version header and the hello message.IoT control via MCP
type: "mcp"). The legacy type: "iot" protocol is deprecated.Malformed JSON
type is missing, the device logs ESP_LOGE(TAG, "Missing message type, data: %s", data); and ignores the message.A simplified two-way exchange:
Device -> Server (handshake)
{
"type": "hello",
"version": 1,
"features": {
"mcp": true,
"aec": true
},
"transport": "websocket",
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}
Server -> Device (handshake ack)
{
"type": "hello",
"transport": "websocket",
"session_id": "xxx",
"audio_params": {
"format": "opus",
"sample_rate": 16000
}
}
Device -> Server (start listening)
{
"session_id": "xxx",
"type": "listen",
"state": "start",
"mode": "auto"
}
The device begins streaming binary Opus frames.
Server -> Device (ASR result)
{
"session_id": "xxx",
"type": "stt",
"text": "what the user said"
}
Server -> Device (TTS start)
{
"session_id": "xxx",
"type": "tts",
"state": "start"
}
The server follows up with binary Opus frames for the device to play.
Server -> Device (TTS stop)
{
"session_id": "xxx",
"type": "tts",
"state": "stop"
}
The device stops playback and, if no further instructions arrive, returns to idle.
This protocol carries JSON text and binary Opus frames over a WebSocket connection to implement audio streaming, TTS playback, speech recognition, device state management, MCP dispatch, and more. Key traits:
"type":"hello" and wait for the server reply."type" (TTS, STT, MCP, WakeWord, System, Alert, Custom, ...).Server and device must agree on the meaning, timing, and error handling of each message type so the session runs smoothly. The text above provides the baseline for integration, debugging, and extension.