Back to Promptfoo

provider-elevenlabs/alignment (ElevenLabs Forced Alignment)

examples/provider-elevenlabs/alignment/README.md

0.121.93.7 KB
Original Source

provider-elevenlabs/alignment (ElevenLabs Forced Alignment)

Generate time-aligned subtitles (SRT/VTT) from audio and transcripts using ElevenLabs forced alignment.

Quick Start

bash
npx promptfoo@latest init --example provider-elevenlabs/alignment
cd provider-elevenlabs/alignment
export ELEVENLABS_API_KEY=your_api_key_here
npx promptfoo@latest eval

What this tests

  • Subtitle generation: Create SRT and VTT subtitle files
  • Word-level alignment: Precise timestamp data for each word
  • Multiple formats: JSON (raw data), SRT (video players), VTT (web players)
  • Accuracy: Verify alignment matches audio timing

How it works

Forced alignment takes two inputs:

  1. Audio file: Speech recording (MP3, WAV, etc.)
  2. Transcript: Text of what was spoken

It returns precise timestamps showing when each word was spoken, formatted as subtitles.

Use Cases

  • Video subtitles: Generate SRT files for video editing software
  • Web captions: Create VTT files for HTML5 video players
  • Karaoke apps: Word-level timing for synchronized highlighting
  • Accessibility: Auto-generate captions for spoken content
  • Translation sync: Time-align translations to original audio

Output Formats

JSON (Raw alignment data)

json
{
  "alignment": [
    { "char": "T", "start": 0.0, "end": 0.1 },
    { "char": "h", "start": 0.1, "end": 0.15 }
  ],
  "characters": "That's one small step..."
}

SRT (Standard video subtitles)

text
1
00:00:00,000 --> 00:00:02,500
That's one small step for man

2
00:00:02,500 --> 00:00:05,000
one giant leap for mankind

VTT (WebVTT for web players)

text
WEBVTT

1
00:00:00.000 --> 00:00:02.500
That's one small step for man

2
00:00:02.500 --> 00:00:05.000
one giant leap for mankind

Configuration

Basic alignment (JSON output)

yaml
providers:
  - id: elevenlabs:alignment:json
    label: Alignment (JSON)

tests:
  - vars:
      audioFile: path/to/audio.mp3
      transcript: 'Your transcript text here'
      format: json

SRT subtitles

yaml
providers:
  - id: elevenlabs:alignment:srt
    label: Alignment (SRT Subtitles)

tests:
  - vars:
      audioFile: path/to/audio.mp3
      transcript: 'Your transcript text here'
      format: srt

VTT subtitles

yaml
providers:
  - id: elevenlabs:alignment:vtt
    label: Alignment (VTT Subtitles)

tests:
  - vars:
      audioFile: path/to/audio.mp3
      transcript: 'Your transcript text here'
      format: vtt

Testing Assertions

yaml
tests:
  # Verify alignment succeeds
  - assert:
      - type: javascript
        value: output.includes('words') # JSON format
      - type: not-contains
        value: error

  # Verify SRT format
  - assert:
      - type: javascript
        value: output.includes('-->') && output.includes('small step')

Best Practices

  1. Transcript accuracy: Ensure transcript exactly matches spoken audio
  2. Include punctuation: Better subtitle chunking and timing
  3. Audio quality: Clear audio produces more accurate timestamps
  4. Format selection:
    • Use SRT for video editing (Premiere, Final Cut, DaVinci)
    • Use VTT for web players (HTML5 <video> tag)
    • Use JSON for custom processing

Cost Information

Forced alignment pricing is based on audio duration:

  • ~$0.05 per minute of audio

The provider automatically tracks costs in evaluation results.

Resources