voice/azure/architecture-llm.md
The @mastra/voice-azure package is a Mastra integration that provides bidirectional voice capabilities (Text-to-Speech and Speech-to-Text) using Microsoft Azure's Cognitive Services Speech SDK. It extends the base MastraVoice class from @mastra/core to provide a standardized interface for voice interactions within the Mastra ecosystem.
Package Version: 0.11.0-beta.0
Primary Dependency: microsoft-cognitiveservices-speech-sdk (v1.45.0)
AzureVoice (Main Class)
├── extends MastraVoice (from @mastra/core)
├── Dependencies
│ ├── microsoft-cognitiveservices-speech-sdk
│ └── Node.js stream API
├── Configuration
│ ├── speechModel (TTS configuration)
│ └── listeningModel (STT configuration)
└── Static Data
└── AZURE_VOICES (200+ voice definitions)
/src
├── index.ts # Main AzureVoice class implementation
├── voices.ts # Static voice definitions (~200+ voices)
└── index.test.ts # Integration test suite
AzureVoice extends MastraVoice from @mastra/core/voice, which provides the abstract interface that all Mastra voice providers must implement.
The class maintains four private properties:
speechConfig?: Azure.SpeechConfig - Configuration for TTS operationslisteningConfig?: Azure.SpeechConfig - Configuration for STT operationsspeechSynthesizer?: Azure.SpeechSynthesizer - TTS synthesizer instancespeechRecognizer?: Azure.SpeechRecognizer - STT recognizer instanceThe constructor accepts an optional configuration object with three properties:
{
speechModel?: {
apiKey?: string // Azure Speech Services API key
region?: string // Azure region (e.g., 'eastus')
voiceName?: string // Default voice (e.g., 'en-US-AriaNeural')
language?: string // Not used for speech synthesis
},
listeningModel?: {
apiKey?: string // Azure Speech Services API key
region?: string // Azure region
language?: string // Recognition language (e.g., 'en-US')
voiceName?: string // Not used for speech recognition
},
speaker?: VoiceId // Default speaker voice ID
}
apiKey and region fall back to AZURE_API_KEY and AZURE_REGION environment variables'en-US-AriaNeural' if no voice is specifiedSpeechConfig, SpeechSynthesizer, SpeechRecognizer) during constructionPurpose: Returns a list of available voice speakers
Returns: Promise<Array<{ voiceId: string; language: string; region: string; }>>
Implementation:
AZURE_VOICES array{lang}-{region}-{name}Neural)Note: This is a synchronous operation wrapped in a promise for API consistency
Purpose: Converts text to speech (TTS)
Parameters:
input: string | NodeJS.ReadableStream - Text to synthesizeoptions?: { speaker?: string; [key: string]: any } - Optional parametersReturns: Promise<NodeJS.ReadableStream> - Audio stream in WAV format
Data Flow:
Input Text/Stream
↓
[Stream Conversion] (if input is stream)
↓
[Text Validation] (check if empty)
↓
[Voice Configuration] (apply speaker option if provided)
↓
[Azure SDK Synthesis] (speakTextAsync)
↓
[Result Validation] (check ResultReason)
↓
[Buffer Wrapping] (convert audioData to Readable stream)
↓
Output Audio Stream
Error Handling:
speechConfig is not initializedPromise.raceTechnical Details:
speakTextAsync APIPurpose: Transcribes audio to text (STT)
Parameters:
audioStream: NodeJS.ReadableStream - Audio input in WAV formatReturns: Promise<string> - Recognized text
Data Flow:
Audio Stream Input
↓
[Buffer Accumulation] (read all chunks into memory)
↓
[Push Stream Creation] (Azure AudioInputStream)
↓
[Audio Config Setup] (fromStreamInput)
↓
[Recognizer Creation] (new SpeechRecognizer)
↓
[Chunk Writing] (write audio in 4096-byte chunks)
↓
[Recognition Execution] (recognizeOnceAsync)
↓
[Result Validation] (check ResultReason.RecognizedSpeech)
↓
Output Text
Error Handling:
listeningConfig is not initializedTechnical Details:
recognizeOnceAsync API (single utterance recognition)Purpose: Checks if listening capabilities are enabled
Returns: Promise<{ enabled: boolean }>
Implementation: Always returns { enabled: true }
Note: This is a simple capability check method, likely used by the Mastra framework to determine if STT is available
Located in voices.ts, this file contains a const array of 200+ voice IDs representing:
Multilingual):DragonHDLatestNeural)AIGenerate1Neural, AIGenerate2Neural)AlloyTurboMultilingualNeural)Standard format: {language}-{region}-{name}Neural
Examples:
en-US-AriaNeuralde-DE-SeraphinaMultilingualNeuralen-US-Andrew:DragonHDLatestNeuralExported as a TypeScript const assertion type:
export type VoiceId = (typeof AZURE_VOICES)[number];
This provides strict type safety for voice selection.
The AzureVoice class fulfills the MastraVoice abstract interface:
super() with standardized configurationThe constructor passes configuration to the base MastraVoice class:
speechModel.name and speechModel.apiKeylisteningModel.name and listeningModel.apiKeyspeaker (default voice)This allows the Mastra framework to track which models are configured.
microsoft-cognitiveservices-speech-sdk (v1.45.0)
Node.js stream API
Readable from 'stream'MastraVoice classtsup.config.ts).js) and CommonJS (.cjs).d.ts type definitions{
".": {
"import": "./dist/index.js",
"require": "./dist/index.cjs"
}
}
Supports both ESM and CommonJS consumers.
The test suite (index.test.ts) covers:
Initialization Tests
getSpeakers() Tests
speak() Tests
listen() Tests
Error Handling Tests
Credential Management
Input Validation
Error Messages
Streaming Improvements
SSML Support
Configuration Caching
Advanced Features
Observability
const voice = new AzureVoice({
speechModel: { apiKey: 'key', region: 'eastus' },
});
const audioStream = await voice.speak('Hello World');
const voice = new AzureVoice({
listeningModel: { apiKey: 'key', region: 'eastus' },
});
const text = await voice.listen(audioStream);
const voice = new AzureVoice({
speechModel: { apiKey: 'key', region: 'eastus' },
listeningModel: { apiKey: 'key', region: 'eastus' },
speaker: 'en-US-JennyNeural',
});
const audio = await voice.speak('Test message');
const transcription = await voice.listen(audio);
const audio = await voice.speak('Bonjour', {
speaker: 'fr-FR-DeniseNeural',
});
The @mastra/voice-azure package provides a clean, TypeScript-native wrapper around Azure's Cognitive Services Speech SDK. It implements the Mastra voice provider interface with proper error handling, type safety, and resource management. The architecture prioritizes simplicity and correctness over advanced features, making it suitable for basic voice synthesis and recognition tasks within the Mastra ecosystem.
Key architectural strengths:
Key areas for improvement: