scripts/benchmark/README.md
This directory contains tools for benchmarking sherpa-onnx Whisper word timestamp accuracy against ground truth alignments from the Montreal Forced Aligner (MFA).
The benchmark suite evaluates how accurately sherpa-onnx predicts word-level timestamps by comparing against MFA alignments on LibriSpeech data. MFA provides high-quality forced alignments that serve as ground truth for measuring timestamp accuracy.
download_librispeech_test_data.pyDownloads and prepares the benchmark dataset:
Usage:
uv run python scripts/benchmark/download_librispeech_test_data.py [--num-utterances 200]
Options:
--num-utterances - Number of utterances to include (default: 200)--output-dir - Output directory (default: benchmark_data)--skip-download - Skip download step and use existing filesOutput:
benchmark_data/audio/*.wav - Audio filesbenchmark_data/manifest.json - Mapping of audio files to ground truth timestampsRequirements:
gdown (for Google Drive downloads)ffmpeg or sox (for audio conversion)run_timestamp_benchmark.pyRuns the timestamp accuracy benchmark against the downloaded ground truth.
Usage:
PYTHONPATH=build/lib:sherpa-onnx/python uv run python scripts/benchmark/run_timestamp_benchmark.py \
--encoder ./whisper-tiny-attention/tiny-encoder.onnx \
--decoder ./whisper-tiny-attention/tiny-decoder.onnx \
--tokens ./whisper-tiny-attention/tiny-tokens.txt
Options:
--encoder - Path to Whisper encoder ONNX model (required)--decoder - Path to Whisper decoder ONNX model (required)--tokens - Path to tokens file (required)--data-dir - Directory with manifest and audio (default: benchmark_data)--output-dir - Output directory for results (default: benchmark_results)--language - Language code (default: en)--num-workers - Number of parallel workers (default: 1)Parallel Processing:
# Run with 4 workers for faster benchmarking
PYTHONPATH=build/lib:sherpa-onnx/python uv run python scripts/benchmark/run_timestamp_benchmark.py \
--encoder ./whisper-tiny-attention/tiny-encoder.onnx \
--decoder ./whisper-tiny-attention/tiny-decoder.onnx \
--tokens ./whisper-tiny-attention/tiny-tokens.txt \
--num-workers 4
Note: Each worker loads its own model copy, so memory usage scales linearly with worker count.
Requirements:
numpyjiwer (for WER calculation)Note on PYTHONPATH: This script uses PYTHONPATH=build/lib:sherpa-onnx/python instead of pip install sherpa-onnx to allow rapid iteration when developing C++ code. After running make in the build directory, you can immediately test without reinstalling the package.
details_YYYYMMDD_HHMMSS.csvPer-word timing errors with columns:
utterance_id - Utterance identifierword_index - Word position in utteranceword - The word textgt_start, gt_end - Ground truth timestamps (seconds)pred_start, pred_end - Predicted timestamps (seconds)matched - Whether the word was successfully alignedstart_error_ms, end_error_ms - Timing errors in millisecondssummary_YYYYMMDD_HHMMSS.csvPer-utterance aggregate statistics:
utterance_id - Utterance identifiernum_gt_words, num_pred_words, num_matched - Word countsmatch_rate - Fraction of ground truth words matchedwer - Word Error Ratemean_start_error_ms, median_start_error_ms, max_start_error_ms - Start time error statisticsmean_end_error_ms, median_end_error_ms, max_end_error_ms - End time error statisticspct_within_20ms, pct_within_50ms - Percentage of words within accuracy thresholds# 1. Build sherpa-onnx
cd build && make -j8 && cd ..
# 2. Export a Whisper model with attention outputs
uv run python scripts/whisper/export-onnx.py --model tiny --with-attention --output-dir ./whisper-tiny-attention
# 3. Download benchmark data
uv run python scripts/benchmark/download_librispeech_test_data.py --num-utterances 200
# 4. Run the benchmark
PYTHONPATH=build/lib:sherpa-onnx/python uv run python scripts/benchmark/run_timestamp_benchmark.py \
--encoder ./whisper-tiny-attention/tiny-encoder.onnx \
--decoder ./whisper-tiny-attention/tiny-decoder.onnx \
--tokens ./whisper-tiny-attention/tiny-tokens.txt \
--num-workers 4
# 5. Review results in benchmark_results/
The audio data comes from the LibriSpeech ASR corpus:
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210). IEEE. https://doi.org/10.1109/ICASSP.2015.7178964
LibriSpeech is derived from read audiobooks from the LibriVox project and is freely available under a CC BY 4.0 license.
The ground truth word alignments were generated using the Montreal Forced Aligner:
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of Interspeech 2017 (pp. 498-502). https://doi.org/10.21437/Interspeech.2017-1386
MFA is an open-source forced alignment tool that uses Kaldi for acoustic modeling.
The pre-computed MFA alignments for LibriSpeech are provided by the librispeech-alignments project by Corentin Jemine.
The LibriSpeech corpus is released under the CC BY 4.0 license. Please ensure compliance with all applicable licenses when using this benchmark data.