examples/speed-benchmark/README.md
This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the Qwen2.5 Speed Benchmark.
For models hosted on HuggingFace, refer to Qwen2.5 Collections-HuggingFace.
For models hosted on ModelScope, refer to Qwen2.5 Collections-ModelScope.
For inference using HuggingFace transformers:
conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers
pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/[email protected]
pip install git+https://github.com/Dao-AILab/[email protected]
pip install -r requirements-perf-transformers.txt
[!Important]
- For
flash-attention, you can use the prebulit wheels from GitHub Releases or installing from source, which requires a compatible CUDA compiler.
- You don't actually need to install
flash-attention. It has been intergrated intotorchas a backend ofsdpa.- For
auto_gptqto use efficent kernels, you need to install from source, because the prebuilt wheels require incompatibletorchversions. Installing from source also requires a compatible CUDA compiler.- For
autoawqto use efficent kenerls, you needautoawq-kernels, which should be automatically installed. If not, runpip install autoawq-kernels.
For inference using vLLM:
conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm
pip install -r requirements-perf-vllm.txt
Below are two methods for executing tests: using a script or the Speed Benchmark tool.
Use the Speed Benchmark tool developed by EvalScope, which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the 📖 User Guide.
Install Dependencies
pip install 'evalscope[perf]' -U
Execute the command as follows:
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--attn-implementation flash_attention_2 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api local \
--dataset speed_benchmark
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--log-every-n-query 1 \
--connect-timeout 60000 \
--read-timeout 60000\
--max-tokens 2048 \
--min-tokens 2048 \
--api local_vllm \
--dataset speed_benchmark
--parallel sets the number of worker threads for concurrent requests, should be fixed at 1.--model specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.--attn-implementation sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.--log-every-n-query: sets how often to log every n requests.--connect-timeout: sets the connection timeout in seconds.--read-timeout: sets the read timeout in seconds.--max-tokens: sets the maximum output length in tokens.--min-tokens: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.--api: sets the inference interface; local inference options are local|local_vllm.--dataset: sets the test dataset; options are speed_benchmark|speed_benchmark_long.Test results can be found in the outputs/{model_name}/{timestamp}/speed_benchmark.json file, which contains all request results and test parameters.
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
Parameter Explanation:
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--outputs_dir`: Output directory, default is `outputs/transformers`
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
Parameter Explanation:
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--max_model_len`: Maximum model length in tokens; default is 32768
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9
`--outputs_dir`: Output directory, default is `outputs/vllm`
`--enforce_eager`: Whether to enforce eager mode; default is False
Test results can be found in the outputs directory, which by default includes two folders for transformers and vllm, storing test results for HuggingFace transformers and vLLM respectively.