Speed Benchmark

This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the Qwen2.5 Speed Benchmark.

1. Model Collections

For models hosted on HuggingFace, refer to Qwen2.5 Collections-HuggingFace.

For models hosted on ModelScope, refer to Qwen2.5 Collections-ModelScope.

2. Environment Setup

For inference using HuggingFace transformers:

shell

conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers

pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/[email protected]
pip install git+https://github.com/Dao-AILab/[email protected]
pip install -r requirements-perf-transformers.txt

[!Important]

For flash-attention, you can use the prebulit wheels from GitHub Releases or installing from source, which requires a compatible CUDA compiler.

You don't actually need to install flash-attention. It has been intergrated into torch as a backend of sdpa.

For auto_gptq to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible torch versions. Installing from source also requires a compatible CUDA compiler.

For autoawq to use efficent kenerls, you need autoawq-kernels, which should be automatically installed. If not, run pip install autoawq-kernels.

For inference using vLLM:

shell

conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm

pip install -r requirements-perf-vllm.txt

3. Execute Tests

Below are two methods for executing tests: using a script or the Speed Benchmark tool.

Method 1: Testing with Speed Benchmark Tool

Use the Speed Benchmark tool developed by EvalScope, which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the 📖 User Guide.

Install Dependencies

shell

pip install 'evalscope[perf]' -U

HuggingFace Transformers Inference

Execute the command as follows:

shell

CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --attn-implementation flash_attention_2 \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local \
 --dataset speed_benchmark

vLLM Inference

shell

CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --log-every-n-query 1 \
 --connect-timeout 60000 \
 --read-timeout 60000\
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local_vllm \
 --dataset speed_benchmark

Parameter Explanation

--parallel sets the number of worker threads for concurrent requests, should be fixed at 1.
--model specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
--attn-implementation sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
--log-every-n-query: sets how often to log every n requests.
--connect-timeout: sets the connection timeout in seconds.
--read-timeout: sets the read timeout in seconds.
--max-tokens: sets the maximum output length in tokens.
--min-tokens: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
--api: sets the inference interface; local inference options are local|local_vllm.
--dataset: sets the test dataset; options are speed_benchmark|speed_benchmark_long.

Test Results

Test results can be found in the outputs/{model_name}/{timestamp}/speed_benchmark.json file, which contains all request results and test parameters.

Method 2: Testing with Scripts

HuggingFace Transformers Inference

Using HuggingFace Hub

shell

python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers

Using ModelScope Hub

shell

python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers

Parameter Explanation:

`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
`--outputs_dir`: Output directory, default is `outputs/transformers`

vLLM Inference

Using HuggingFace Hub

shell

python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm

Using ModelScope Hub

shell

python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm

Parameter Explanation:

`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
`--generate_length`: Number of tokens to generate; default is 2048
`--max_model_len`: Maximum model length in tokens; default is 32768  
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`   
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
`--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9  
`--outputs_dir`: Output directory, default is `outputs/vllm`  
`--enforce_eager`: Whether to enforce eager mode; default is False

Test Results

Test results can be found in the outputs directory, which by default includes two folders for transformers and vllm, storing test results for HuggingFace transformers and vLLM respectively.

Notes

Conduct multiple tests and take the average, with a typical value of 3 tests.
Ensure the GPU is idle before testing to avoid interference from other tasks.