docs/mqtt-udp.md
This document describes the MQTT + UDP hybrid protocol used between the device and the server, based on the current implementation: MQTT carries control messages, UDP carries real-time audio.
The protocol uses two channels:
sequenceDiagram
participant Device as ESP32 device
participant MQTT as MQTT broker
participant UDP as UDP server
Note over Device, UDP: 1. Establish MQTT connection
Device->>MQTT: MQTT Connect
MQTT->>Device: Connected
Note over Device, UDP: 2. Request audio channel
Device->>MQTT: Hello message (type: "hello", transport: "udp")
MQTT->>Device: Hello response (UDP endpoint + encryption keys)
Note over Device, UDP: 3. Establish UDP connection
Device->>UDP: UDP Connect
UDP->>Device: Connected
Note over Device, UDP: 4. Audio streaming
loop Audio stream
Device->>UDP: Encrypted audio (Opus)
UDP->>Device: Encrypted audio (Opus)
end
Note over Device, UDP: 5. Control messages
par Control
Device->>MQTT: Listen / TTS / MCP messages
MQTT->>Device: STT / TTS / MCP / Alert responses
end
Note over Device, UDP: 6. Teardown
Device->>MQTT: Goodbye
Device->>UDP: Disconnect
The device connects to the broker using:
{
"type": "hello",
"version": 3,
"transport": "udp",
"features": {
"mcp": true,
"aec": true
},
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}
features.mcp is always set; features.aec is set when CONFIG_USE_SERVER_AEC is enabled.
{
"type": "hello",
"transport": "udp",
"session_id": "xxx",
"audio_params": {
"format": "opus",
"sample_rate": 24000,
"channels": 1,
"frame_duration": 60
},
"udp": {
"server": "192.168.1.100",
"port": 8888,
"key": "0123456789ABCDEF0123456789ABCDEF",
"nonce": "0123456789ABCDEF0123456789ABCDEF"
}
}
Field reference:
udp.server - UDP server address.udp.port - UDP server port.udp.key - AES key, hex-encoded.udp.nonce - AES nonce, hex-encoded.Listen
{
"session_id": "xxx",
"type": "listen",
"state": "start",
"mode": "manual"
}
Abort
{
"session_id": "xxx",
"type": "abort",
"reason": "wake_word_detected"
}
MCP
{
"session_id": "xxx",
"type": "mcp",
"payload": {
"jsonrpc": "2.0",
"id": 1,
"result": {}
}
}
Goodbye
{
"session_id": "xxx",
"type": "goodbye"
}
Semantics match the WebSocket protocol. Supported types:
start, stop, sentence_start)."command": "reboot".status, message, emotion.CONFIG_RECEIVE_CUSTOM_MESSAGE).Example alert:
{
"session_id": "xxx",
"type": "alert",
"status": "Warning",
"message": "Battery low",
"emotion": "sad"
}
After the device receives the MQTT hello response, it:
|type 1B|flags 1B|payload_len 2B|ssrc 4B|timestamp 4B|sequence 4B|
|payload payload_len bytes|
Field reference:
type: packet type, always 0x01.flags: flags, currently unused.payload_len: payload length (network byte order).ssrc: synchronization source identifier.timestamp: timestamp (network byte order).sequence: sequence number (network byte order).payload: encrypted Opus audio data.Uses AES-CTR with:
local_sequence_ is incremented monotonically.remote_sequence_ validates continuity.stateDiagram
direction TB
[*] --> Disconnected
Disconnected --> MqttConnecting: StartMqttClient()
MqttConnecting --> MqttConnected: MQTT Connected
MqttConnecting --> Disconnected: Connect failed
MqttConnected --> RequestingChannel: OpenAudioChannel()
RequestingChannel --> ChannelOpened: Hello exchange success
RequestingChannel --> MqttConnected: Hello timeout / failed
ChannelOpened --> UdpConnected: UDP connect success
UdpConnected --> AudioStreaming: Start audio
AudioStreaming --> UdpConnected: Stop audio
UdpConnected --> ChannelOpened: UDP disconnect
ChannelOpened --> MqttConnected: CloseAudioChannel()
MqttConnected --> Disconnected: MQTT disconnect
The device determines whether the audio channel is available with:
bool IsAudioChannelOpened() const {
return udp_ != nullptr && !error_occurred_ && !IsTimeout();
}
Read from storage:
endpoint - broker address.client_id - client identifier.username - user name.password - password.keepalive - keep-alive interval (default 240 s).publish_topic - publish topic.The base Protocol class provides timeout detection:
A mutex protects the UDP connection:
std::lock_guard<std::mutex> lock(channel_mutex_);
| Feature | MQTT + UDP | WebSocket |
|---|---|---|
| Control channel | MQTT | WebSocket |
| Audio channel | UDP (encrypted) | WebSocket (binary) |
| Latency | Low (UDP) | Medium |
| Reliability | Medium | High |
| Complexity | High | Low |
| Encryption | AES-CTR | TLS |
| Firewall friendliness | Low | High |
The MQTT + UDP hybrid protocol achieves efficient audio communication through:
The protocol is a good fit for low-latency voice interaction, at the cost of higher network complexity than pure WebSocket.