Back to Sglang

SGLang CI failure monitoring

scripts/ci_monitor/README.md

0.5.111.5 KB
Original Source

SGLang CI failure monitoring

Scripts used by .github/workflows/ci-failure-monitor.yml: scheduled failure analysis.

Tools

  1. Failures Analyzer (ci_failures_analysis.py): Tracks consecutive failures, identifies flaky jobs, and monitors runner health across PR Test / Nightly workflows (Nvidia, AMD, Intel, XPU, NPU).

Installation

bash
pip install requests

Usage

Failures Analyzer

bash
export GITHUB_TOKEN="your_token_here"

python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 50 --threshold 2
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3

Token permissions

The GitHub token needs repo and workflow scopes to read CI run data; otherwise API calls may return 404.

Failures Analyzer parameters

ParameterDefaultDescription
--tokenRequiredGitHub Personal Access Token
--limit500Number of workflow runs to analyze
--threshold3Alert threshold for consecutive failures
--outputNoneOutput JSON file (optional)

Historical note

The former CI Monitor workflow (ci-monitor.yml) and its analyzers (ci_analyzer.py, ci_analyzer_perf.py, ci_analyzer_balance.py) were removed as redundant; use this failure monitor workflow and scripts for ongoing CI health alerts.