Back to Sherpa Onnx

Whisper Timestamp Accuracy Benchmark

scripts/benchmark/README.md

1.13.06.5 KB
Original Source

Whisper Timestamp Accuracy Benchmark

This directory contains tools for benchmarking sherpa-onnx Whisper word timestamp accuracy against ground truth alignments from the Montreal Forced Aligner (MFA).

Overview

The benchmark suite evaluates how accurately sherpa-onnx predicts word-level timestamps by comparing against MFA alignments on LibriSpeech data. MFA provides high-quality forced alignments that serve as ground truth for measuring timestamp accuracy.

Scripts

download_librispeech_test_data.py

Downloads and prepares the benchmark dataset:

  • LibriSpeech dev-clean audio (converted to 16kHz mono WAV)
  • MFA word alignments with precise word boundaries

Usage:

bash
uv run python scripts/benchmark/download_librispeech_test_data.py [--num-utterances 200]

Options:

  • --num-utterances - Number of utterances to include (default: 200)
  • --output-dir - Output directory (default: benchmark_data)
  • --skip-download - Skip download step and use existing files

Output:

  • benchmark_data/audio/*.wav - Audio files
  • benchmark_data/manifest.json - Mapping of audio files to ground truth timestamps

Requirements:

  • gdown (for Google Drive downloads)
  • ffmpeg or sox (for audio conversion)

run_timestamp_benchmark.py

Runs the timestamp accuracy benchmark against the downloaded ground truth.

Usage:

bash
PYTHONPATH=build/lib:sherpa-onnx/python uv run python scripts/benchmark/run_timestamp_benchmark.py \
    --encoder ./whisper-tiny-attention/tiny-encoder.onnx \
    --decoder ./whisper-tiny-attention/tiny-decoder.onnx \
    --tokens ./whisper-tiny-attention/tiny-tokens.txt

Options:

  • --encoder - Path to Whisper encoder ONNX model (required)
  • --decoder - Path to Whisper decoder ONNX model (required)
  • --tokens - Path to tokens file (required)
  • --data-dir - Directory with manifest and audio (default: benchmark_data)
  • --output-dir - Output directory for results (default: benchmark_results)
  • --language - Language code (default: en)
  • --num-workers - Number of parallel workers (default: 1)

Parallel Processing:

bash
# Run with 4 workers for faster benchmarking
PYTHONPATH=build/lib:sherpa-onnx/python uv run python scripts/benchmark/run_timestamp_benchmark.py \
    --encoder ./whisper-tiny-attention/tiny-encoder.onnx \
    --decoder ./whisper-tiny-attention/tiny-decoder.onnx \
    --tokens ./whisper-tiny-attention/tiny-tokens.txt \
    --num-workers 4

Note: Each worker loads its own model copy, so memory usage scales linearly with worker count.

Requirements:

  • numpy
  • jiwer (for WER calculation)
  • Built sherpa-onnx library

Note on PYTHONPATH: This script uses PYTHONPATH=build/lib:sherpa-onnx/python instead of pip install sherpa-onnx to allow rapid iteration when developing C++ code. After running make in the build directory, you can immediately test without reinstalling the package.

Output Format

details_YYYYMMDD_HHMMSS.csv

Per-word timing errors with columns:

  • utterance_id - Utterance identifier
  • word_index - Word position in utterance
  • word - The word text
  • gt_start, gt_end - Ground truth timestamps (seconds)
  • pred_start, pred_end - Predicted timestamps (seconds)
  • matched - Whether the word was successfully aligned
  • start_error_ms, end_error_ms - Timing errors in milliseconds

summary_YYYYMMDD_HHMMSS.csv

Per-utterance aggregate statistics:

  • utterance_id - Utterance identifier
  • num_gt_words, num_pred_words, num_matched - Word counts
  • match_rate - Fraction of ground truth words matched
  • wer - Word Error Rate
  • mean_start_error_ms, median_start_error_ms, max_start_error_ms - Start time error statistics
  • mean_end_error_ms, median_end_error_ms, max_end_error_ms - End time error statistics
  • pct_within_20ms, pct_within_50ms - Percentage of words within accuracy thresholds

Metrics Explained

  • Start/End Time Error: Absolute difference between predicted and ground truth timestamps
  • Match Rate: How many ground truth words were successfully aligned with predictions
  • WER (Word Error Rate): Standard ASR accuracy metric (lower is better)
  • Accuracy Thresholds: Percentage of words with start time error within 20ms, 50ms, or 100ms

Example Workflow

bash
# 1. Build sherpa-onnx
cd build && make -j8 && cd ..

# 2. Export a Whisper model with attention outputs
uv run python scripts/whisper/export-onnx.py --model tiny --with-attention --output-dir ./whisper-tiny-attention

# 3. Download benchmark data
uv run python scripts/benchmark/download_librispeech_test_data.py --num-utterances 200

# 4. Run the benchmark
PYTHONPATH=build/lib:sherpa-onnx/python uv run python scripts/benchmark/run_timestamp_benchmark.py \
    --encoder ./whisper-tiny-attention/tiny-encoder.onnx \
    --decoder ./whisper-tiny-attention/tiny-decoder.onnx \
    --tokens ./whisper-tiny-attention/tiny-tokens.txt \
    --num-workers 4

# 5. Review results in benchmark_results/

Data Sources and Citations

LibriSpeech Corpus

The audio data comes from the LibriSpeech ASR corpus:

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210). IEEE. https://doi.org/10.1109/ICASSP.2015.7178964

LibriSpeech is derived from read audiobooks from the LibriVox project and is freely available under a CC BY 4.0 license.

Montreal Forced Aligner (MFA)

The ground truth word alignments were generated using the Montreal Forced Aligner:

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of Interspeech 2017 (pp. 498-502). https://doi.org/10.21437/Interspeech.2017-1386

MFA is an open-source forced alignment tool that uses Kaldi for acoustic modeling.

Pre-computed LibriSpeech Alignments

The pre-computed MFA alignments for LibriSpeech are provided by the librispeech-alignments project by Corentin Jemine.

License

The LibriSpeech corpus is released under the CC BY 4.0 license. Please ensure compliance with all applicable licenses when using this benchmark data.