examples/provider-elevenlabs/stt/README.md
This example demonstrates how to use ElevenLabs STT provider for audio transcription testing.
npx promptfoo@latest init --example provider-elevenlabs/stt
cd provider-elevenlabs/stt
export ELEVENLABS_API_KEY=your_api_key_here
npx promptfoo@latest eval
Set your API key:
export ELEVENLABS_API_KEY=your_api_key_here
Prepare audio files:
Create an audio/ directory with your test audio files:
mkdir -p audio
# Place your audio files in the audio/ directory
Run the evaluation:
promptfoo eval
providers:
- id: elevenlabs:stt:basic
config:
modelId: eleven_speech_to_text_v1
language: en # ISO 639-1 language code
Identify and label different speakers in your audio:
providers:
- id: elevenlabs:stt:diarization
config:
modelId: eleven_speech_to_text_v1
diarization: true
maxSpeakers: 3 # Optional: hint for expected number of speakers
The response will include speaker segments:
{
"text": "Full transcription...",
"diarization": [
{
"speaker_id": "speaker_0",
"text": "Hello, how are you?",
"start_time_ms": 0,
"end_time_ms": 2500,
"confidence": 0.95
},
{
"speaker_id": "speaker_1",
"text": "I'm doing well, thanks!",
"start_time_ms": 2500,
"end_time_ms": 5000,
"confidence": 0.92
}
]
}
Word Error Rate (WER) measures transcription accuracy. Lower is better (0 = perfect).
providers:
- id: elevenlabs:stt:accuracy
config:
modelId: eleven_speech_to_text_v1
calculateWER: true
referenceText: The quick brown fox jumps over the lazy dog
WER Formula: (Substitutions + Deletions + Insertions) / Total Words
The response includes detailed WER metrics:
{
"wer": 0.05, // 5% error rate
"substitutions": 1,
"deletions": 0,
"insertions": 0,
"correct": 19,
"totalWords": 20,
"details": {
"reference": "the quick brown fox jumps",
"hypothesis": "the quick green fox jumps",
"alignment": "REF: the quick brown fox jumps\nHYP: the quick green fox jumps\nOPS: SSSSS"
}
}
WER Interpretation:
| Format | Extension | Notes |
|---|---|---|
| MP3 | .mp3 | Widely compatible |
| MP4 Audio | .mp4, .m4a | AAC/MPEG-4 audio |
| WAV | .wav | Uncompressed, high quality |
| FLAC | .flac | Lossless compression |
| OGG | .ogg | Open format |
| Opus | .opus | Modern, efficient codec |
| WebM | .webm | Web-optimized |
providers:
- id: elevenlabs:stt
config:
audioFile: path/to/audio.mp3
prompts:
- audio/sample1.mp3
- audio/sample2.wav
tests:
- vars:
audioFile: audio/sample.mp3
tests:
- assert:
- type: cost
threshold: 0.05 # Max $0.05 per transcription
tests:
- assert:
- type: latency
threshold: 10000 # Max 10 seconds
tests:
- assert:
- type: contains
value: expected phrase
- type: not-contains
value: incorrect phrase
tests:
- assert:
- type: javascript
value: |
const wer = context.vars.metadata?.wer?.wer || 1;
wer < 0.1 // Less than 10% error
tests:
- assert:
- type: javascript
value: |
const diarization = context.vars.metadata?.transcription?.diarization || [];
const uniqueSpeakers = new Set(diarization.map(s => s.speaker_id));
uniqueSpeakers.size === 2 // Expect 2 speakers
ElevenLabs STT supports 30+ languages. Specify using ISO 639-1 codes:
config:
language: en # English
# language: es # Spanish
# language: fr # French
# language: de # German
# language: it # Italian
# language: pt # Portuguese
# language: ja # Japanese
# language: ko # Korean
# language: zh # Chinese
Auto-detection: Omit language to let the API detect the language automatically.
STT pricing is based on audio duration:
The provider automatically tracks and reports costs in the evaluation results.
prompts:
- audio/batch1.mp3
- audio/batch2.mp3
- audio/batch3.mp3
providers:
- id: elevenlabs:stt
config:
modelId: eleven_speech_to_text_v1
# Test all files with consistent assertions
tests:
- assert:
- type: cost
threshold: 0.10
- type: latency
threshold: 15000
providers:
- id: elevenlabs:stt:english
config:
language: en
- id: elevenlabs:stt:spanish
config:
language: es
- id: elevenlabs:stt:autodetect
config:
# No language specified = auto-detect
prompts:
- audio/english_sample.mp3
- audio/spanish_sample.mp3
Compare transcription accuracy across different audio qualities:
prompts:
- audio/high_quality_48khz.wav
- audio/medium_quality_16khz.mp3
- audio/low_quality_8khz.mp3
providers:
- id: elevenlabs:stt
config:
calculateWER: true
referenceText: This is the expected transcription text
tests:
- description: High quality should have WER < 5%
vars:
audioFile: audio/high_quality_48khz.wav
assert:
- type: javascript
value: (context.vars.metadata?.wer?.wer || 1) < 0.05
- description: Medium quality should have WER < 10%
vars:
audioFile: audio/medium_quality_16khz.mp3
assert:
- type: javascript
value: (context.vars.metadata?.wer?.wer || 1) < 0.10
# Verify your API key is set
echo $ELEVENLABS_API_KEY
# Or set it inline
ELEVENLABS_API_KEY=your_key promptfoo eval
Error: Failed to read audio file: ENOENT: no such file or directory
Solution: Use absolute paths or paths relative to the config file:
prompts:
- /absolute/path/to/audio.mp3
- ./relative/path/to/audio.mp3
Error: Unsupported audio format
Solution: Convert your audio to a supported format (MP3, WAV, etc.) using tools like ffmpeg:
ffmpeg -i input.video -vn -acodec mp3 output.mp3
If you're getting unexpectedly high WER:
| Option | Type | Default | Description |
|---|---|---|---|
modelId | string | eleven_speech_to_text_v1 | STT model to use |
language | string | auto-detect | ISO 639-1 language code |
diarization | boolean | false | Enable speaker identification |
maxSpeakers | number | - | Expected number of speakers |
audioFile | string | - | Path to audio file |
audioFormat | string | auto-detect | Audio format override |
referenceText | string | - | Expected transcription for WER |
calculateWER | boolean | false | Calculate Word Error Rate |
baseUrl | string | https://api.elevenlabs.io/v1 | API endpoint |
timeout | number | 120000 | Request timeout (ms) |
retries | number | 3 | Number of retry attempts |