docs/source/getting_started/speed_benchmark.md
We report the speed performance of bfloat16 models and quantized models (including FP8, GPTQ, AWQ) of the Qwen3 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under different context lengths.
Inference Speed (tokens/s) is calculated as:
\text{Speed} = \frac{\text{tokens}_{\text{prompt}} + \text{tokens}_{\text{generation}}}{\text{time}}
We use a batch size of 1 and the minimum number of GPUs possible for evaluation.
We test the speed and memory usage when generating 2048 tokens, with input lengths of
1, 6144, 14336, 30720, 63488, and 129024 tokens.
For SGLang:
mem_fraction_static=0.85.context_length=140000 and enable enable_mixed_chunk=True.skip_tokenizer_init=True and perform generation using input_ids instead of raw text prompts.FP8 Performance in Transformers: The inference speed of Transformers in FP8 mode is currently not optimal and requires further optimization.
GPTQ-INT4 Performance in SGLang: The performance of GPTQ-INT4 in SGLang also needs improvement, and we are actively working with the team to enhance it.