docs_new/docs/references/nightly_precision_regression.mdx
The nightly precision regression framework detects silent numerical regressions in the SGLang serving engine by comparing per-layer hidden states between consecutive runs. It runs as a nightly CI job on 8×H200 GPUs and can also be invoked locally for development and debugging.
The framework operates on a rolling-baseline model:
Baselines are stored locally on disk and synced to a HuggingFace dataset so they survive across CI runners and can be shared across machines. The HF dataset store is required — the test errors if SGLANG_PRECISION_HF_REPO is unset.
┌──────────────────────────────────────────────────────────────┐
│ 1. Resolve model config (layer count, capture layers) │
│ ↓ │
│ 2. Compute capture_signature (schema, layers, TP, filter) │
│ ↓ │
│ 3. Fetch baseline from HF dataset (signature-matched) │
│ ↓ │
│ 4. Launch SGLang server with DUMPER enabled │
│ ↓ │
│ 5. POST /dumper/configure (set layer filter + cleanup) │
│ ↓ │
│ 6. POST /v1/chat/completions (fixed prompt, 2 tokens, │
│ ignore_eos=true to force decode path) │
│ ↓ │
│ 7. Kill server; assert decode tensors were captured │
│ ↓ │
│ 8. Baseline exists (with matching signature)? │
│ ├── YES → Run comparator → pass/fail │
│ │ ├── PASS → update baseline, push to HF │
│ │ └── FAIL → push diagnostics to HF │
│ └── NO → copy today's tensors as initial baseline │
│ → push to HF as "baseline_established" │
│ ↓ │
│ 9. Report summary (stdout + GitHub Step Summary) │
└──────────────────────────────────────────────────────────────┘
| Component | File | Purpose |
|---|---|---|
| Test entry point | test/registered/debug_utils/test_nightly_precision_regression.py | Orchestrates server launch, dump, compare, and reporting |
| HF baseline store | python/sglang/test/precision_baseline_store.py | Push / fetch / prune baselines on a HuggingFace dataset |
| Tensor comparator | python/sglang/srt/debug_utils/comparator/ | Compares two directories of .pt tensors, emits JSONL report |
| Dumper infrastructure | python/sglang/srt/debug_utils/dumper.py | Captures per-layer hidden states at runtime |
| CI workflow | .github/workflows/nightly-test-nvidia.yml | Schedules the nightly job on 8×H200 |
Not every layer is dumped — the framework uses a strided capture to reduce I/O and storage overhead. By default, it captures:
LAYER_CAPTURE_STRIDE)The layer count is resolved automatically from the model's HuggingFace config.json (num_hidden_layers or num_layers). If resolution fails, all layers are captured as a safe fallback.
The dumper filter is built dynamically as a regex matching only the selected layer indices, e.g.:
match(r'^non_intrusive__model\.layers\.(0|7|15|23)\.inputs\.1$', name)
The test generates 2 tokens with ignore_eos=True to ensure the model's decode path is exercised. After the dump, _assert_decode_captured() verifies that tensors from the decode step were actually captured (not just prefill). If only prefill tensors are found, the test fails immediately — this catches misconfigurations where --max-total-tokens is too low for the decode loop to run.
The comparator computes relative differences (rel_diff) for each tensor and checks them against a configurable threshold (default 1e-3). For tensor-parallel models, the --override-dims flag tells the comparator how to reduce across TP ranks before comparing:
--override-dims ^non_intrusive__model\.layers\.\d+\.inputs\.1$:bs h[tp:partial]
This sums partial TP contributions along the hidden dimension before computing the diff, so the comparison is semantically correct even with TP > 1.
If the comparator returns exit code 0 but compared zero layers (baseline/target name mismatch), the test fails with a diagnostic message rather than silently passing.
A capture_signature (SHA-1 hash of schema version, max_tokens, ignore_eos, TP size, and dumper filter) is computed per run. The HF store uses this signature during fetch to ensure only baselines with an identical capture shape are considered. If the signature changes (e.g. you add layers to the capture set or change TP), the framework establishes a fresh baseline instead of erroring on incompatible tensors.
| Variable | Default | Description |
|---|---|---|
SGLANG_PRECISION_MODELS | zai-org/GLM-5.1-FP8 | Comma-separated HuggingFace model IDs to test |
SGLANG_PRECISION_BASELINE_DIR | /tmp/sglang_precision_baselines | Local directory for baseline tensors |
SGLANG_PRECISION_DIFF_THRESHOLD | 1e-3 | Per-tensor relative diff threshold |
SGLANG_PRECISION_FORCE_UPDATE | 0 | Set to 1 to skip comparison and unconditionally refresh baseline |
SGLANG_PRECISION_COMMIT | (auto-detected from git) | Override the sglang commit SHA tagged on push |
SGLANG_PRECISION_HF_REPO | (required) | HuggingFace dataset repo for cross-runner baseline storage |
SGLANG_PRECISION_HF_REVISION | main | Branch/revision of the HF dataset |
HF_TOKEN | (required in CI) | HuggingFace token with write access to the dataset |
The nightly job nightly-test-precision-8-gpu-h200 is defined in .github/workflows/nightly-test-nvidia.yml and runs on an 8-GPU H200 runner. It is included in the nightly suite via test/run_suite.py.
Key CI configuration:
- name: Run precision regression test
timeout-minutes: 120
env:
SGLANG_PRECISION_BASELINE_DIR: /tmp/sglang_precision_baselines
SGLANG_PRECISION_HF_REPO: ${{ vars.SGLANG_PRECISION_HF_REPO }}
SGLANG_PRECISION_HF_REVISION: ${{ vars.SGLANG_PRECISION_HF_REVISION || 'main' }}
HF_TOKEN: ${{ secrets.HF_TOKEN_PRECISION_STORE }}
SGLANG_PRECISION_COMMIT: ${{ github.sha }}
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-precision-8-gpu-h200 --nightly --continue-on-error --timeout-per-file 3600
| Name | Type | Purpose |
|---|---|---|
SGLANG_PRECISION_HF_REPO | Repository variable | HF dataset repo ID (e.g. org/sglang-precision-baselines) — required, the test errors if unset |
SGLANG_PRECISION_HF_REVISION | Repository variable (optional) | Dataset branch (defaults to main) |
HF_TOKEN_PRECISION_STORE | Repository secret | HF token with write access to the dataset |
When running in CI, the test writes a Markdown table to the GitHub Actions job summary showing each model's status (PASSED, FAILED, BASELINE_ESTABLISHED, or ERROR).
Baselines are organized in the HF dataset as:
<model_sanitized>/<YYYY>/<MM>/<DD>/run-<sha7>/
├── meta.json # Run metadata (model, commit, hardware, thresholds, stats)
├── comparator_report.jsonl # Per-tensor comparison results
└── tensors/
├── layer_0_inputs_1.pt
├── layer_7_inputs_1.pt
└── ...
A top-level manifest.jsonl tracks all runs with one JSON object per line. Each manifest row carries a capture_signature field so that fetch selects only baselines with a matching capture shape.
The prune_old_runs() function (callable manually) retains daily runs for 30 days and keeps one run per week beyond that window.
Edit the default in test/registered/debug_utils/test_nightly_precision_regression.py:
DEFAULT_MODELS_FOR_NIGHTLY_PRECISION = "zai-org/GLM-5.1-FP8,your-org/your-model"
Or set the SGLANG_PRECISION_MODELS environment variable in the CI workflow to override the default.
export SGLANG_PRECISION_MODELS="your-org/your-model"
export SGLANG_PRECISION_BASELINE_DIR="/tmp/my_precision_baselines"
export SGLANG_PRECISION_DIFF_THRESHOLD="1e-3"
export SGLANG_PRECISION_HF_REPO="your-org/sglang-precision-baselines"
export HF_TOKEN="hf_..."
cd test
python3 -m pytest registered/debug_utils/test_nightly_precision_regression.py -v
Verify the model works with the dumper. Run locally first to ensure hidden states are captured correctly:
export SGLANG_PRECISION_MODELS="your-org/your-model"
export SGLANG_PRECISION_BASELINE_DIR="/tmp/test_baselines"
export SGLANG_PRECISION_HF_REPO="your-org/sglang-precision-baselines"
export HF_TOKEN="hf_..."
export SGLANG_PRECISION_FORCE_UPDATE="1" # first run: establish baseline
cd test
python3 -m pytest registered/debug_utils/test_nightly_precision_regression.py -v -k test_precision
Run a comparison pass (remove FORCE_UPDATE):
unset SGLANG_PRECISION_FORCE_UPDATE
python3 -m pytest registered/debug_utils/test_nightly_precision_regression.py -v -k test_precision
This should report PASSED if the engine is numerically stable for the model.
Set the tensor-parallelism size. If the model requires TP > 1, the test harness defaults to tp_size=8 for all models. To customize, modify the ModelLaunchSettings construction in the test or pass extra server arguments:
# In setUpClass or via env-driven logic
cls.models = [ModelLaunchSettings("your-org/your-model", tp_size=4)]
Adjust the diff threshold if needed. FP8 or quantized models may exhibit larger numerical differences. Set SGLANG_PRECISION_DIFF_THRESHOLD to an appropriate value (e.g., 1e-2 for FP8).
Add to the default model list or configure SGLANG_PRECISION_MODELS in the CI workflow.
| Concern | How to handle |
|---|---|
| TP size != 8 | Override tp_size in ModelLaunchSettings or add model-specific logic |
| Quantized models (FP8, GPTQ) | Loosen SGLANG_PRECISION_DIFF_THRESHOLD (e.g., 1e-2) |
| Model needs extra server args | Pass them via ModelLaunchSettings(model, extra_args=["--quantization", "fp8"]) |
| Model needs different prompt | Modify PROMPT constant or make it model-configurable |
| MoE models with TP partial sums | Already handled by --override-dims (bs h[tp:partial]) |
| Fewer/more capture layers | Adjust LAYER_CAPTURE_STRIDE (default 8); set lower for smaller models |
| Decode not captured | Ensure --max-total-tokens is well above the scheduler's decode reservation (default 512); the test uses 4096 |
huggingface_hub installedHF_TOKEN. The HF store is mandatory — SGLANG_PRECISION_HF_REPO must be set or the test will error at startup. This is because the nightly CI runners are ephemeral (no persistent local disk), so baselines must survive across runs via the HF dataset. There is currently no local-only fallback.# All three are required — the test errors if SGLANG_PRECISION_HF_REPO is unset.
export SGLANG_PRECISION_MODELS="Qwen/Qwen2.5-0.5B-Instruct"
export SGLANG_PRECISION_BASELINE_DIR="/tmp/precision_baselines"
export SGLANG_PRECISION_HF_REPO="your-org/sglang-precision-baselines"
export HF_TOKEN="hf_..."
# First run: establish baseline
cd test
python3 -m pytest registered/debug_utils/test_nightly_precision_regression.py -v
# Second run: compare against baseline
python3 -m pytest registered/debug_utils/test_nightly_precision_regression.py -v
export SGLANG_PRECISION_FORCE_UPDATE="1"
python3 -m pytest registered/debug_utils/test_nightly_precision_regression.py -v
| Status | Meaning |
|---|---|
BASELINE_ESTABLISHED | No prior baseline with a matching signature existed; today's tensors saved as the new baseline |
PASSED | All per-layer hidden states are within the diff threshold; baseline updated |
FAILED | One or more layers exceeded the diff threshold, or 0 layers were compared (baseline/target mismatch); diagnostic data pushed to HF |
ERROR | Server launch, inference, or comparison encountered an unexpected error |
============================================================
Nightly Precision Regression Summary
============================================================
Model Status Details
------------------------------------------------------------
zai-org/GLM-5.1-FP8 PASSED comparison ok, baseline updated
Qwen/Qwen2.5-0.5B-Instruct FAILED tensor=layer_23.inputs_1 rel_diff=0.0152
============================================================
/tmp/nightly_precision_<model>_*.logpass_label="failed" for offline diagnosisBaselines are stored at:
$SGLANG_PRECISION_BASELINE_DIR/<model_sanitized>/nightly_precision/*.pt
A baseline_meta.json next to the tensors records the timestamp and commit that produced the baseline.
prune_old_runs() to garbage-collect old baselines (keeps 30 days of daily runs, one per week after that).If an intentional numerical change (e.g., kernel optimization, model refactor) causes a comparison failure:
SGLANG_PRECISION_FORCE_UPDATE=1 and run the test once to establish a new baselineIf you change the capture configuration (stride, TP size, etc.), the capture_signature will differ and the framework automatically establishes a fresh baseline — no manual intervention needed.
The framework uses a rolling baseline: every successful comparison updates the baseline to the current run's tensors. This means the reference shifts forward each day. While individual day-to-day diffs stay within the configured threshold, tiny numerical differences can accumulate over time, causing the baseline to silently drift away from the original golden values.
Implications:
Mitigation strategies (not yet implemented):
The test requires a HuggingFace dataset (SGLANG_PRECISION_HF_REPO) and a write-capable HF_TOKEN. There is no local-only fallback. This is by design — CI runners have no persistent local disk, so the HF dataset is the only way to carry baselines across runs. If you need to run the test locally, you must set up a HF dataset (even a private one) and provide the corresponding token.
| File | Role |
|---|---|
test/registered/debug_utils/test_nightly_precision_regression.py | Main test — server lifecycle, dump, compare, report |
python/sglang/test/precision_baseline_store.py | HF dataset store — push, fetch, prune baselines |
python/sglang/srt/debug_utils/comparator/ | Tensor comparison engine |
python/sglang/srt/debug_utils/dumper.py | Runtime hidden-state capture |
.github/workflows/nightly-test-nvidia.yml | CI workflow definition |
test/run_suite.py | Test suite registration (includes nightly-precision-8-gpu-h200) |