docs/advanced_features/adaptive_speculative_decoding.md
Adaptive speculative decoding lets SGLang adjust speculative_num_steps/speculative_num_draft_tokens at runtime instead of keeping a single fixed value for the whole server lifetime.
It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal.
--speculative-algorithm EAGLE--speculative-eagle-topk 1speculative_num_steps controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload.
num_steps is too small, the draft model could have produced more accepted tokens, but the round stops too early.num_steps is too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted.Real traffic often moves between high-acceptance and low-acceptance phases, so one fixed step count is usually a compromise. Adaptive mode tries to follow the workload instead of hard-coding a single global num_steps.
The adaptive mechanism has three pieces:
AdaptiveSpeculativeParams: the EMA-based policySpecRuntimeState: the per-tier runtime state bundleAdaptiveController: the coordinator that chooses a tier and activates the matching runtime stateAt startup, SGLang pre-builds one runtime state per candidate tier. By default, the candidate tiers are candidate_steps = [1, 3, 7].
┌──────────────────────────────────────────────────────────┐
│ SpecRuntimeState │
│ │
│ speculative_num_steps / speculative_num_draft_tokens │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Draft stage │ │ Verify stage │ │ Extend stage │ │
│ │ │ │ │ │ │ │
│ │ attn_backend │ │ attn_backend │ │ attn_backend │ │
│ │ cuda_graph │ │ cuda_graph │ │ cuda_graph │ │
│ └────────────────┘ └────────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
This matters because CudaGraphRunner is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture.
The adaptive update happens after verify and affects the next round, not the current one:
┌─────────────────────────────────────────────────────────────────────┐
│ EAGLEWorker.forward_batch_generation() — decode path │
│ │
│ ① draft(batch) │
│ │ draft model multi-step generation with current tier │
│ v │
│ ② verify(batch, spec_info) │
│ │ target model tree verification │
│ │ → produces accept_length_per_req │
│ v │
│ ③ forward_draft_extend_after_decode(batch) │
│ │ draft model KV-cache catch-up │
│ v │
│ ④ adaptive_controller.on_verify_complete(accept_lengths) │
│ │ │
│ │ update EMA, apply warmup / interval / hysteresis gates │
│ │ if tier changed, select a pre-built state from pool │
│ v │
│ worker.apply_runtime_state(state) │
│ │
│ Tier switch happens after the current round completes. │
│ Backends and CUDA graphs are never swapped mid-round. │
└─────────────────────────────────────────────────────────────────────┘
After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the pre-built candidate tiers [1, 3, 7] by default.
The decision logic is intentionally conservative:
warmup_batches skips the first few batchesupdate_interval avoids switching every batchdown_hysteresis and up_hysteresis reduce oscillationConceptually, the policy probes one step beyond the observed acceptance:
target_steps ≈ clamp(round(ema_accept_len) + 1, min(candidate_steps), max(candidate_steps))
So if recent requests consistently accept more drafted tokens, the policy tends to move up. If they start rejecting earlier, it tends to move down.
--speculative-adaptive-config is optional, but the speculative setup still needs to be valid for adaptive mode.
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-eagle-topk 1 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 4 \
--speculative-adaptive
If you want to override the defaults, add --speculative-adaptive-config /path/to/adaptive_spec.json.
Example config:
{
"candidate_steps": [1, 3, 7],
"ema_alpha": 0.2,
"warmup_batches": 10,
"update_interval": 5
}
The config file is optional. Any omitted keys use defaults.
| Key | Default | Meaning |
|---|---|---|
candidate_steps | [1, 3, 7] | Discrete speculative_num_steps tiers that adaptive mode can switch between |
ema_alpha | 0.2 | EMA smoothing factor for accepted draft length |
update_interval | 5 | Recompute interval, in verify batches, after warmup |
warmup_batches | 10 | Number of verify batches to observe before switching |
down_hysteresis | -0.25 | Extra margin before moving to a smaller step |
up_hysteresis | 0.0 | Extra margin before moving to a larger step |
The initial --speculative-num-steps is snapped to the nearest value in candidate_steps.
You can inspect the active tier and acceptance metric via /server_info:
curl -s http://127.0.0.1:30000/server_info | jq '.internal_states[0] | {speculative_num_steps, avg_spec_accept_length}'
speculative_num_steps is the current active tieravg_spec_accept_length helps explain whether the server is likely to move up or down[1, 3, 7]ema_alpha to react faster, or lower it for more stabilitywarmup_batches or update_interval if tier switching is too noisy