docs/features/speculative_decoding/dynamic_speculative_decoding.md
SD methods need to verify K tokens for each sequence during decoding. As BS increases, the effective BS becomes BS*K which increases the compute requirement during verification. When this BS*K goes beyond a critical BS then SD negatively impacts the decode speed (TPOT). DSD helps by tuning the K to an optimal value such that we continue to reap the benefits from SD.
--speculative-config schemaTo use Dynamic SD, add num_speculative_tokens_per_batch_size to the config of an SD method which is a list of list. Here, an entry is [start_bs, end_bs, optimal_K] which means when the concurrency is within range [start_bs, end_bs] then optimal_K number of draft tokens are used. For e.g.,
--speculative-config '{
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"num_speculative_tokens_per_batch_size": [
[1, 64, 3],
[65, 128, 1],
[129, 512, 0]
]
}'
implies that:
VLLM_USE_V2_MODEL_RUNNER=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"num_speculative_tokens_per_batch_size": [
[1, 64, 3],
[65, 128, 1],
[129, 512, 0]
]
}'
VLLM_USE_V2_MODEL_RUNNER=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"num_speculative_tokens_per_batch_size": [
[1, 16, 5],
[17, 32, 4],
[33, 64, 3],
[65, 128, 1],
[129, 512, 0]
]
}'
We are working on enabling it on MRv2 with full cuda graph support.