Architecture Documentation: @mastra/voice-azure

Overview

The @mastra/voice-azure package is a Mastra integration that provides bidirectional voice capabilities (Text-to-Speech and Speech-to-Text) using Microsoft Azure's Cognitive Services Speech SDK. It extends the base MastraVoice class from @mastra/core to provide a standardized interface for voice interactions within the Mastra ecosystem.

Package Version: 0.11.0-beta.0 Primary Dependency: microsoft-cognitiveservices-speech-sdk (v1.45.0)

Core Architecture

High-Level Component Structure

AzureVoice (Main Class)
├── extends MastraVoice (from @mastra/core)
├── Dependencies
│   ├── microsoft-cognitiveservices-speech-sdk
│   └── Node.js stream API
├── Configuration
│   ├── speechModel (TTS configuration)
│   └── listeningModel (STT configuration)
└── Static Data
    └── AZURE_VOICES (200+ voice definitions)

File Organization

/src
  ├── index.ts          # Main AzureVoice class implementation
  ├── voices.ts         # Static voice definitions (~200+ voices)
  └── index.test.ts     # Integration test suite

Class: AzureVoice

Inheritance

AzureVoice extends MastraVoice from @mastra/core/voice, which provides the abstract interface that all Mastra voice providers must implement.

Private State

The class maintains four private properties:

speechConfig?: Azure.SpeechConfig - Configuration for TTS operations
listeningConfig?: Azure.SpeechConfig - Configuration for STT operations
speechSynthesizer?: Azure.SpeechSynthesizer - TTS synthesizer instance
speechRecognizer?: Azure.SpeechRecognizer - STT recognizer instance

Constructor Configuration

The constructor accepts an optional configuration object with three properties:

typescript

{
  speechModel?: {
    apiKey?: string      // Azure Speech Services API key
    region?: string      // Azure region (e.g., 'eastus')
    voiceName?: string   // Default voice (e.g., 'en-US-AriaNeural')
    language?: string    // Not used for speech synthesis
  },
  listeningModel?: {
    apiKey?: string      // Azure Speech Services API key
    region?: string      // Azure region
    language?: string    // Recognition language (e.g., 'en-US')
    voiceName?: string   // Not used for speech recognition
  },
  speaker?: VoiceId      // Default speaker voice ID
}

Configuration Initialization Flow

Environment Variable Fallback: Both apiKey and region fall back to AZURE_API_KEY and AZURE_REGION environment variables
Validation: Constructor throws errors if required credentials are missing
Dual Configuration: Speech synthesis and recognition are configured independently
Default Voice: Speech synthesis defaults to 'en-US-AriaNeural' if no voice is specified
SDK Initialization: Creates Azure SDK instances (SpeechConfig, SpeechSynthesizer, SpeechRecognizer) during construction

Public Methods

1. getSpeakers()

Purpose: Returns a list of available voice speakers

Returns: Promise<Array<{ voiceId: string; language: string; region: string; }>>

Implementation:

Maps over the static AZURE_VOICES array
Parses voice IDs to extract language and region (format: {lang}-{region}-{name}Neural)
Returns metadata for all 200+ available voices

Note: This is a synchronous operation wrapped in a promise for API consistency

2. speak(input, options?)

Purpose: Converts text to speech (TTS)

Parameters:

input: string | NodeJS.ReadableStream - Text to synthesize
options?: { speaker?: string; [key: string]: any } - Optional parameters

Returns: Promise<NodeJS.ReadableStream> - Audio stream in WAV format

Data Flow:

Input Text/Stream
    ↓
[Stream Conversion] (if input is stream)
    ↓
[Text Validation] (check if empty)
    ↓
[Voice Configuration] (apply speaker option if provided)
    ↓
[Azure SDK Synthesis] (speakTextAsync)
    ↓
[Result Validation] (check ResultReason)
    ↓
[Buffer Wrapping] (convert audioData to Readable stream)
    ↓
Output Audio Stream

Error Handling:

Throws if speechConfig is not initialized
Handles stream reading errors
Validates input text is not empty
Implements 5-second timeout using Promise.race
Properly closes synthesizer in finally block
Validates synthesis result reason

Technical Details:

Creates new synthesizer instance per request
Uses Azure's speakTextAsync API
Returns audio data as a Node.js Readable stream containing a single Buffer
Audio format is determined by Azure SDK defaults (typically 16kHz, 16-bit, mono PCM WAV)

3. listen(audioStream)

Purpose: Transcribes audio to text (STT)

Parameters:

audioStream: NodeJS.ReadableStream - Audio input in WAV format

Returns: Promise<string> - Recognized text

Data Flow:

Audio Stream Input
    ↓
[Buffer Accumulation] (read all chunks into memory)
    ↓
[Push Stream Creation] (Azure AudioInputStream)
    ↓
[Audio Config Setup] (fromStreamInput)
    ↓
[Recognizer Creation] (new SpeechRecognizer)
    ↓
[Chunk Writing] (write audio in 4096-byte chunks)
    ↓
[Recognition Execution] (recognizeOnceAsync)
    ↓
[Result Validation] (check ResultReason.RecognizedSpeech)
    ↓
Output Text

Error Handling:

Throws if listeningConfig is not initialized
Validates recognition result reason
Provides detailed error messages with reason codes
Properly closes recognizer in finally block

Technical Details:

Accumulates entire audio stream into memory before processing
Writes audio data in 4096-byte chunks to Azure's push stream
Uses Azure's recognizeOnceAsync API (single utterance recognition)
Audio must be in WAV format compatible with Azure's requirements

4. getListener()

Purpose: Checks if listening capabilities are enabled

Returns: Promise<{ enabled: boolean }>

Implementation: Always returns { enabled: true }

Note: This is a simple capability check method, likely used by the Mastra framework to determine if STT is available

Voice Definitions

AZURE_VOICES Array

Located in voices.ts, this file contains a const array of 200+ voice IDs representing:

Languages: 50+ languages including Arabic, English, German, Spanish, Chinese, etc.
Regions: Multiple regional variants (e.g., en-US, en-GB, en-AU)
Voice Types:
- Standard Neural voices
- Multilingual voices (suffixed with Multilingual)
- HD voices (suffixed with :DragonHDLatestNeural)
- AI-generated voices (e.g., AIGenerate1Neural, AIGenerate2Neural)
- Turbo multilingual voices (e.g., AlloyTurboMultilingualNeural)

Voice ID Format

Standard format: {language}-{region}-{name}Neural Examples:

en-US-AriaNeural
de-DE-SeraphinaMultilingualNeural
en-US-Andrew:DragonHDLatestNeural

VoiceId Type

Exported as a TypeScript const assertion type:

typescript

export type VoiceId = (typeof AZURE_VOICES)[number];

This provides strict type safety for voice selection.

Integration with Mastra Framework

Base Class Contract

The AzureVoice class fulfills the MastraVoice abstract interface:

Constructor: Calls super() with standardized configuration
speak(): Implements text-to-speech
listen(): Implements speech-to-text
getSpeakers(): Returns available voices
getListener(): Returns listener capability status

Configuration Propagation

The constructor passes configuration to the base MastraVoice class:

speechModel.name and speechModel.apiKey
listeningModel.name and listeningModel.apiKey
speaker (default voice)

This allows the Mastra framework to track which models are configured.

Error Handling Strategy

Configuration Errors

Missing API key → throws immediately in constructor
Missing region → throws immediately in constructor

Runtime Errors

Unconfigured models → throws on method call
Empty input text → throws validation error
Stream reading errors → wrapped and re-thrown
Synthesis/recognition failures → detailed error messages with Azure reason codes
Timeout protection → 5-second timeout on synthesis

Resource Cleanup

Uses try/catch/finally pattern
Closes synthesizer and recognizer instances
Prevents resource leaks

Dependencies

External Dependencies

microsoft-cognitiveservices-speech-sdk (v1.45.0)
- Core Azure Speech Services SDK
- Provides SpeechConfig, SpeechSynthesizer, SpeechRecognizer
- Handles audio format conversion and streaming
Node.js stream API
- Uses Readable from 'stream'
- Handles streaming input/output

Peer Dependencies

@mastra/core (>=1.0.0-0 <2.0.0-0)
- Provides base MastraVoice class
- Defines voice provider interface

Build and Distribution

Build Configuration

Bundler: tsup (configured in tsup.config.ts)
Output formats: ESM (.js) and CommonJS (.cjs)
TypeScript: Generates .d.ts type definitions
Source maps: Generated for all outputs

Package Exports

json

{
  ".": {
    "import": "./dist/index.js",
    "require": "./dist/index.cjs"
  }
}

Supports both ESM and CommonJS consumers.

Testing Strategy

The test suite (index.test.ts) covers:

Initialization Tests
- Default parameter initialization
- Environment variable fallback
getSpeakers() Tests
- Voice list structure validation
- Metadata completeness
speak() Tests
- Basic text synthesis
- Custom voice selection
- Stream input handling
- Parameter variations
- Error cases (empty text, missing API key)
listen() Tests
- Audio file transcription
- Stream transcription
- Round-trip (speak → listen) verification
Error Handling Tests
- Empty input validation
- Missing credentials
- Configuration errors

Test Utilities

Creates test output directory for audio files
Writes synthesized audio to files for inspection
Uses real Azure API (requires credentials)

Performance Considerations

Memory Usage

listen() accumulates entire audio stream in memory
Large audio files may cause memory pressure
Consider streaming alternatives for production use

Timeout Protection

speak() implements 5-second timeout
Prevents hanging on Azure API failures
May need adjustment for longer texts

Resource Management

Creates new synthesizer/recognizer instances per request
No instance pooling or reuse
Proper cleanup in finally blocks

Security Considerations

Credential Management
- API keys via environment variables (recommended)
- Supports direct configuration (use with caution)
- No credential validation before API calls
Input Validation
- Text input validated for emptiness
- No SSML injection protection
- Stream inputs trusted
Error Messages
- May expose internal Azure error details
- Consider sanitizing errors in production

Future Enhancement Opportunities

Streaming Improvements
- Stream audio chunks incrementally in listen()
- Reduce memory footprint for large files
SSML Support
- Add explicit SSML input handling
- Validate and sanitize SSML
Configuration Caching
- Reuse synthesizer/recognizer instances
- Connection pooling
Advanced Features
- Voice customization parameters
- Audio format selection
- Real-time streaming synthesis
- Continuous recognition mode
Observability
- Add metrics/telemetry
- Performance monitoring
- Usage tracking

Usage Examples

Basic TTS

typescript

const voice = new AzureVoice({
  speechModel: { apiKey: 'key', region: 'eastus' },
});
const audioStream = await voice.speak('Hello World');

Basic STT

typescript

const voice = new AzureVoice({
  listeningModel: { apiKey: 'key', region: 'eastus' },
});
const text = await voice.listen(audioStream);

Full Bidirectional

typescript

const voice = new AzureVoice({
  speechModel: { apiKey: 'key', region: 'eastus' },
  listeningModel: { apiKey: 'key', region: 'eastus' },
  speaker: 'en-US-JennyNeural',
});

const audio = await voice.speak('Test message');
const transcription = await voice.listen(audio);

Custom Voice Selection

typescript

const audio = await voice.speak('Bonjour', {
  speaker: 'fr-FR-DeniseNeural',
});

Summary

The @mastra/voice-azure package provides a clean, TypeScript-native wrapper around Azure's Cognitive Services Speech SDK. It implements the Mastra voice provider interface with proper error handling, type safety, and resource management. The architecture prioritizes simplicity and correctness over advanced features, making it suitable for basic voice synthesis and recognition tasks within the Mastra ecosystem.

Key architectural strengths:

Clean separation of TTS and STT configuration
Type-safe voice selection with 200+ options
Proper resource cleanup
Error handling with detailed messages
Framework integration through base class inheritance

Key areas for improvement:

Memory efficiency in STT processing
Advanced streaming capabilities
Configuration caching and reuse
Enhanced observability