Back to Sglang

ASR Benchmark

benchmark/asr/README.md

0.5.116.1 KB
Original Source

ASR Benchmark

This benchmark evaluates the performance and accuracy (Word Error Rate - WER) of Automatic Speech Recognition (ASR) models served via SGLang.

Supported Models

  • openai/whisper-large-v3
  • openai/whisper-large-v3-turbo
  • Qwen/Qwen3-ASR-1.7B
  • Qwen/Qwen3-ASR-0.6B

Setup

Install the required dependencies:

bash
apt install ffmpeg
pip install librosa soundfile datasets evaluate jiwer transformers openai torchcodec torch

Running the Benchmark

1. Start SGLang Server

Launch the SGLang server with a Whisper model:

bash
python -m sglang.launch_server --model-path openai/whisper-large-v3 --port 30000

2. Run the Benchmark Script

Basic usage (using chat completions API):

bash
python bench_sglang.py --base-url http://localhost:30000 --model openai/whisper-large-v3 --n-examples 10

Using the OpenAI-compatible transcription API:

bash
python bench_sglang.py \
    --base-url http://localhost:30000 \
    --model openai/whisper-large-v3 \
    --api-type transcription \
    --language English \
    --n-examples 10

Run with streaming and show real-time output:

bash
python bench_sglang.py \
    --base-url http://localhost:30000 \
    --model openai/whisper-large-v3 \
    --api-type transcription \
    --stream \
    --show-predictions \
    --concurrency 1

Run with higher concurrency and save results:

bash
python bench_sglang.py \
    --base-url http://localhost:30000 \
    --model openai/whisper-large-v3 \
    --concurrency 8 \
    --n-examples 100 \
    --output results.json \
    --show-predictions

Arguments

ArgumentDescriptionDefault
--base-urlSGLang server URLhttp://localhost:30000
--modelModel name on the serveropenai/whisper-large-v3
--datasetHuggingFace dataset for evaluationD4nt3/esb-datasets-earnings22-validation-tiny-filtered
--splitDataset split to usevalidation
--concurrencyNumber of concurrent requests4
--n-examplesNumber of examples to process (-1 for all)-1
--outputPath to save results as JSONNone
--show-predictionsDisplay sample predictionsFalse
--print-nNumber of samples to display5
--api-typeAPI to use: chat (chat completions) or transcription (audio transcriptions)chat
--languageLanguage for transcription API (e.g., English, en)None
--streamEnable streaming mode for transcription APIFalse

Metrics

The benchmark outputs:

MetricDescription
Total RequestsNumber of successful ASR requests processed
WERWord Error Rate (lower is better), computed using the evaluate library
Average LatencyMean time per request (seconds)
Median Latency50th percentile latency (seconds)
95th Latency95th percentile latency (seconds)
ThroughputRequests processed per second
Token ThroughputOutput tokens per second

Example Output

bash
python bench_sglang.py --api-type transcription --concurrency 128 --model openai/whisper-large-v3 --show-predictions

Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered...
Using API type: transcription
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Performing warmup...
Processing 511 samples...
------------------------------
Results for openai/whisper-large-v3:
Total Requests: 511
WER: 12.7690
Average Latency: 1.3602s
Median Latency: 1.2090s
95th Latency: 2.9986s
Throughput: 19.02 req/s
Token Throughput: 354.19 tok/s
Total Test Time: 26.8726s
------------------------------

==================== Sample Predictions ====================
Sample 1:
  REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement
  PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement
----------------------------------------
Sample 2:
  REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
  PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
----------------------------------------
Sample 3:
  REF: we talked about 4.7 gigawatts
  PRED: we talked about 4.7 gigawatts
----------------------------------------
Sample 4:
  REF: and you know depending on that working capital build we will we will see what that yields
  PRED: and depending on that working capital build we will see what that yields what
----------------------------------------
Sample 5:
  REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70%
  PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70%
----------------------------------------
============================================================

Notes

  • Audio samples longer than 30 seconds are automatically filtered out (Whisper limitation)
  • The benchmark performs a warmup request before measuring performance
  • Results are normalized using the model's tokenizer when available
  • When using --stream with --show-predictions, use --concurrency 1 for clean sequential output
  • The --language option accepts both full names (e.g., English) and ISO 639-1 codes (e.g., en)

Troubleshooting

Server connection refused

  • Ensure the SGLang server is running and accessible at the specified --base-url
  • Check that the port is not blocked by a firewall

Out of memory errors

  • Reduce --concurrency to lower GPU memory usage
  • Use a smaller Whisper model variant