scripts/whisper/README.md
This folder contains code showing how to convert Whisper to onnx and use onnxruntime to replace PyTorch for speech recognition.
You can use sherpa-onnx to run the converted model.
Please see https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/export-onnx.html for details.
The export-onnx-with-attention.py script exports Whisper models with
cross-attention weights for word-level timestamps. It requires knowing which
attention heads are "alignment heads" - heads that show monotonically increasing
attention patterns useful for aligning audio to text.
For standard OpenAI Whisper models, alignment heads are defined in the
ALIGNMENT_HEADS dict in the export script. For new or custom models (like
distil-whisper variants), you can discover alignment heads using:
python find_alignment_heads.py --model <model-name> --audio <test-audio.wav>
This script analyzes all attention heads and ranks them by:
Example output:
Top 15 alignment head candidates:
------------------------------------------------------------
Layer Head Monotonic Diagonal Combined
------------------------------------------------------------
3 2 0.846 0.985 0.915
0 0 0.962 0.617 0.789
...
Heads with high combined scores (>0.7) are good candidates. A single head with a very high diagonal score (>0.9) is often sufficient for accurate timestamps.