3rdparty/amd/tuning/TUNING.md
This AppNote describes the SGLang performance tuning technical, code harness and running steps for systems with AMD Instinct GPUs. Harness code, examples and steps are provided in detail, to facilitate easy reproduce & use to tune performance towards workloads. Three primary runtime areas are covered:
To maximize Triton kernel efficiency, several strategies can be employed:
1 to enhance performance by eliminating the convert_layout operation in the kernel's epilogue.@triton.autotune(configs=[
triton.Config({'waves_per_eu': 1}, num_warps=4, num_stages=1),
triton.Config({'waves_per_eu': 1}, num_warps=8, num_stages=1),
triton.Config({'waves_per_eu': 1}, num_warps=16, num_stages=1),
triton.Config({'waves_per_eu': 2}, num_warps=4, num_stages=1),
triton.Config({'waves_per_eu': 2}, num_warps=8, num_stages=1),
triton.Config({'waves_per_eu': 2}, num_warps=16, num_stages=1),
triton.Config({'waves_per_eu': 4}, num_warps=4, num_stages=1),
triton.Config({'waves_per_eu': 4}, num_warps=8, num_stages=1),
triton.Config({'waves_per_eu': 4}, num_warps=16, num_stages=1),
], key=['BLOCK_N', 'NUM_TOKEN_BLKS'], use_cuda_graph=True)
@triton.jit
def _triton_kernel_function():
...
TunableOp is a feature in PyTorch that allows for the definition and optimization of custom kernels with tunable parameters. This feature is particularly useful for enhancing the performance of kernels by experimenting with different configurations.
PYTORCH_TUNABLEOP_ENABLED:
01 to enable TunableOp.PYTORCH_TUNABLEOP_TUNING:
10 to disable tuning. If a tuned entry is not found, it will run the tuning step and record the entry when PYTORCH_TUNABLEOP_ENABLED is enabled.PYTORCH_TUNABLEOP_VERBOSE:
01 to enable verbose output for TunableOp.To enable TunableOp and tuning, and optionally enable verbose mode, you can run the following command in your terminal:
#Tuning
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=1 your_script.sh
#Inference with tuning op
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_script.sh
#Print out the log
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_VERBOSE=1 your_script.sh
The following are suggestions for optimizing matrix multiplication (GEMM) and convolution (conv) operations in PyTorch using Inductor, a part of the PyTorch compilation framework. The goal is to leverage Triton to achieve better performance.
To tune Triton kernels with GEMM and convolution ops (conv), use the torch.compile function with the max-autotune mode. This benchmarks a predefined list of Triton configurations and selects the fastest one for each shape.
Max Autotune:
torch._inductor.config.max_autotune = True or TORCHINDUCTOR_MAX_AUTOTUNE=1.Fine-Grained Control:
torch._inductor.config.max_autotune_gemm = True.torch._inductor.config.max_autotune.pointwise = True.Backend Selection:
torch._inductor.max_autotune_gemm_backends to limit backends to TRITON for better performance.Freezing for Inference:
torch._inductor.config.freezing=True to enable constant folding optimizations.Debugging:
TORCH_COMPILE_DEBUG=1 to extract Triton kernels generated by Inductor.#Gemm Tuning
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 your_script.sh
#Specify your backend to TRITON for Gemm Tuning
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON your_script.sh
#Inference with large improvement on AMD GPU
TORCHINDUCTOR_FREEZING=1 your_script.sh
To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration
#Tuning
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
#so we can tune decode moe use below command
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
# and use this command to tune prefill moe
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
For more detailed information on tuning SGLang performance with AMD GPUs, please refer to the following link: