examples/voice-agent/README.md
A Spring AI Alibaba voice agent implementing the Sandwich architecture (STT → Agent → TTS).
User Input (text or audio) → STT (optional) → Sandwich Agent → TTS → Audio Output
stt_chunk / stt_output), e.g. paraformer-realtime-v2add_to_order, confirm_order)tts_chunk), e.g. cosyvoice-v3-flash单模型直接处理:音频输入 → 音频输出,无中间文本。
User Audio → [Single Model] → Audio Output
| 厂商 | 模型/产品 | 说明 |
|---|---|---|
| OpenAI | gpt-realtime (Realtime API) | 单模型处理音频输入输出,低延迟,支持工具调用 |
| Gemini 2.5 Flash + Gemini Live API | 原生音频输入输出,30+ HD 音色、24 种语言 | |
| ElevenLabs | Conversational AI 2.0 | 端到端语音对话,<100ms 延迟,32+ 语言 |
特点:延迟更低、音色/情感保留更好;厂商选择较少。
本示例采用的 Sandwich 架构:三个组件串联。
User Audio → STT → LLM Agent → TTS → Audio Output
| 厂商 | 能力 | 说明 |
|---|---|---|
| 阿里云 | 智能语音交互 (ISI) | 实时 ASR + 通义 LLM + 语音合成,需自行串联 |
| 科大讯飞 | 星火语音大模型 | ASR + 语音合成,语音交互平台 |
| 百度 | SpeechT5 等 | 统一语音模型,支持 ASR、TTS、翻译 |
| 微软 | Azure Speech + Copilot | Speech Services + 对话模型,需组合 |
特点:组件可替换、灵活性高;延迟略高,依赖 TTS 质量。
| 类型 | 延迟 | 音色/情感 | 厂商数量 |
|---|---|---|---|
| 原生 S2S | 更低(单模型) | 更好 | 较少 |
| 组合式 S2S | 略高(多步串联) | 依赖 TTS | 较多 |
DashScope:提供实时 ASR(如 paraformer-realtime-v2,流式音频输入)和 TTS(如 cosyvoice-v3-flash,流式文本输入),本示例采用组合式串联,支持端到端流式。
AI_DASHSCOPE_API_KEY)export AI_DASHSCOPE_API_KEY=your-api-key
./mvnw -pl examples/voice-agent spring-boot:run
Start the app and open http://localhost:8080 in a browser for a real-time voice interaction UI:
add_to_order(item, quantity) – Add item to the orderconfirm_order(order_summary) – Confirm the final orderAvailable: lettuce, tomato, onion, pickles, mayo, mustard; turkey, ham, roast beef; swiss, cheddar, provolone.
Connect to ws://localhost:8080/voice/ws/audio:
{"type":"end"} when donestt_chunk, stt_output, then agent + TTS eventsUses DashScope realtime ASR (e.g. paraformer-realtime-v2) and streaming TTS (e.g. cosyvoice-v3-flash).
Connect to ws://localhost:8080/voice/ws and send JSON:
{"text": "I would like a turkey sandwich with lettuce and mayo"}
For TTS-only (no agent), send:
{"text": "One turkey sandwich", "ttsOnly": true}
Server streams JSON events:
| Event | Fields | Description |
|---|---|---|
stt_chunk | transcript, ts | Partial STT (realtime audio) |
stt_output | transcript, ts | Final STT result |
agent_chunk | text, ts | Agent response text chunk |
tool_call | id, name, args, ts | Tool invocation |
tool_result | toolCallId, name, result, ts | Tool execution result |
agent_end | ts | Agent finished |
tts_chunk | audio (base64), ts | Synthesized audio chunk |
error | message | Error message |
Example (Node.js):
const ws = new WebSocket('ws://localhost:8080/voice/ws');
ws.onopen = () => ws.send(JSON.stringify({text: "One turkey sandwich"}));
ws.onmessage = (e) => {
const ev = JSON.parse(e.data);
if (ev.type === 'agent_chunk') process.stdout.write(ev.text);
if (ev.type === 'tts_chunk') { /* decode base64, play audio */ }
};