docs_new/docs/advanced_features/adaptive_speculative_decoding.mdx
Adaptive speculative decoding lets SGLang adjust speculative_num_steps/speculative_num_draft_tokens at runtime instead of keeping a single fixed value for the whole server lifetime.
It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal.
--speculative-algorithm EAGLE or EAGLE3--speculative-eagle-topk 1speculative_num_steps controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload.
num_steps is too small, the draft model could have produced more accepted tokens, but the round stops too early.num_steps is too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted.Real traffic often moves between high-acceptance and low-acceptance phases, and batch sizes vary continuously. Adaptive mode follows both signals at runtime instead of hard-coding a single global num_steps.
The adaptive mechanism has three pieces:
AdaptiveSpeculativeParams: the EMA-based policySpecRuntimeState: the per-tier runtime state bundleAdaptiveController: the coordinator that queries the policy for the current batch size and activates the matching runtime stateThe controller maintains independent EMA trackers for each batch size range, so observations at small BS don't pollute the large BS signal. Each BS range can have its own candidate steps, hysteresis thresholds, and ceiling coefficient.
BS ranges are defined as lower bounds in the config file (e.g., keys "1" and "8" mean BS 1–7 uses one slot, BS 8+ uses another). SpecRuntimeState objects are shared across BS ranges with the same step count — each state owns CUDA graphs captured for the reachable padded batch sizes of that step.
---
title: "SpecRuntimeState — speculative_num_steps / speculative_num_draft_tokens"
---
graph LR
subgraph SR[" "]
direction LR
subgraph D["Draft stage"]
direction TB
d1[attn_backend]
d2[cuda_graph]
end
subgraph V["Verify stage"]
direction TB
v1[attn_backend]
v2[cuda_graph]
end
subgraph E["Extend stage"]
direction TB
e1[attn_backend]
e2[cuda_graph]
end
end
This matters because CudaGraphRunner is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture.
The adaptive update happens in two places:
---
title: "EAGLEWorker.forward_batch_generation() — decode path"
---
flowchart TD
Z["⓪ activate_step_by_batch(batch_size)
query optimal step for current BS range, activate if different"]
A["① draft(batch)
draft model multi-step generation with current tier"]
B["② verify(batch, spec_info)
target model tree verification → produces num_correct_drafts_per_req"]
C["③ forward_draft_extend_after_decode(batch)
draft model KV-cache catch-up"]
D["④ adaptive_controller.on_verify_complete(num_correct_drafts_per_req, batch_size)
update EMA for matching BS slot, apply warmup / interval / hysteresis gates
if tier changed, select a pre-built state from pool"]
E["worker.apply_runtime_state(state)"]
Z --> A --> B --> C --> D --> E
Tier switch happens after the current round completes. Backends and CUDA graphs are never swapped mid-round.
After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the candidate tiers for the matching BS slot.
The decision logic is intentionally conservative:
warmup_batches skips the first few batchesupdate_interval avoids switching every batchdown_hysteresis and up_hysteresis reduce oscillationceiling_coeff — an optional EMA ceiling rule can cap num_steps proportionally to observed draft quality, preventing over-speculation at high BSConceptually, the policy probes one step beyond the observed acceptance:
target_steps ≈ clamp(round(ema_accept_len) + 1, min(candidate_steps), max(candidate_steps))
So if recent requests consistently accept more drafted tokens, the policy tends to move up. If they start rejecting earlier, it tends to move down.
--speculative-adaptive-config is optional, but the speculative setup still needs to be valid for adaptive mode.
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-eagle-topk 1 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 4 \
--speculative-adaptive
If you want to override the defaults, add --speculative-adaptive-config /path/to/adaptive_spec.json.
Example config:
{
"ema_alpha": 0.2,
"warmup_batches": 10,
"update_interval": 5,
"1": {"candidate_steps": [1, 3, 7], "up_hysteresis": 0.0, "down_hysteresis": -0.25, "ceiling_coeff": 0},
"8": {"candidate_steps": [1], "up_hysteresis": 0.0, "down_hysteresis": 0.0, "ceiling_coeff": 0}
}
Non-integer keys (ema_alpha, warmup_batches, update_interval) are global overrides applied to every BS slot. Integer keys ("1", "8") define per-BS slots.
The config file is optional. When provided, each integer BS-slot key must specify candidate_steps; all other keys fall back to defaults.
You can inspect the active tier and acceptance metric via /server_info:
curl -s http://127.0.0.1:30000/server_info | jq '.internal_states[0] | {speculative_num_steps, avg_spec_accept_length}'
speculative_num_steps is the current active tieravg_spec_accept_length helps explain whether the server is likely to move up or downema_alpha to react faster, or lower it for more stabilitywarmup_batches or update_interval if tier switching is too noisy[1, 2] or [1]) often outperform wide onesThe built-in default is conservative — safe for all draft models but may under-speculate for strong ones. Save one of these as a JSON file and pass via --speculative-adaptive-config.
This is the built-in default: BS 8–31 allows [1, 3], and BS≥32 locks to step=1 to avoid wasted compute. Best for models like MiniMax-M2.5, DSV4.
{
"1": {"candidate_steps": [1, 3, 7], "up_hysteresis": 0.0, "down_hysteresis": -0.25, "ceiling_coeff": 0},
"8": {"candidate_steps": [1, 3], "up_hysteresis": 0.0, "down_hysteresis": 0.0, "ceiling_coeff": 0},
"32": {"candidate_steps": [1], "up_hysteresis": 0.0, "down_hysteresis": 0.0, "ceiling_coeff": 0}
}
Uses wider ladders with ceiling rule to cap speculation at high BS. Best for models like GLM-4.7-FP8.
{
"1": {"candidate_steps": [1, 3, 7], "up_hysteresis": 0.0, "down_hysteresis": -0.25, "ceiling_coeff": 0},
"8": {"candidate_steps": [1, 3, 7], "up_hysteresis": 0.0, "down_hysteresis": -0.25, "ceiling_coeff": 3.0},
"64": {"candidate_steps": [1, 3], "up_hysteresis": 0.0, "down_hysteresis": -0.25, "ceiling_coeff": 1.67},
"128": {"candidate_steps": [1, 3], "up_hysteresis": 0.0, "down_hysteresis": -0.25, "ceiling_coeff": 1.2}
}
For the best performance, benchmark your specific model across batch sizes with different static num_steps values, then build a per-BS config that matches each range's optimal step. A well-tuned per-model config might outperform the generic presets above.