docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_profiling.mdx
During inference serving, it is sometimes necessary to monitor the internal execution flow of the serving framework to identify performance issues. By collecting start/end timestamps of key flows, identifying critical functions or iterations, recording key events, and gathering relevant information, you can quickly locate performance bottlenecks.
This guide walks you through the complete workflow of collecting performance data in an SGLang Ascend NPU inference service — from preparation, collection, and analysis to visualization — helping you get started with performance profiling quickly.
For more profiling scenarios (e.g., Nsight Systems, PD disaggregation, etc.), see SGLang Benchmark and Profiling.
SGLang has built-in PyTorch Profiler support. Through the Ascend torch_npu
backend, you can directly collect NPU operator-level performance data. No
additional packages are required — profiling start/stop is controlled via API
requests.
Launch an SGLang online service and set the SGLANG_TORCH_PROFILER_DIR
environment variable to control where performance files are saved. Once the
service starts, profiling is ready on standby.
# Set the performance data output directory
export SGLANG_TORCH_PROFILER_DIR=./sglang_profile
# Start SGLang server (use local model path or HuggingFace model id)
sglang serve \
--model-path /path/to/your/model \
--attention-backend ascend \
--host 0.0.0.0 --port 30000 \
--tp-size 1 \
--max-running-requests 128
Profiling-related environment variables:
<table> <thead> <tr> <th>Variable</th> <th>Description</th> <th>Default</th> </tr> </thead> <tbody> <tr> <td><code>SGLANG_TORCH_PROFILER_DIR</code></td> <td>Trace file output directory</td> <td><code>/tmp</code></td> </tr> <tr> <td><code>SGLANG_PROFILE_WITH_STACK</code></td> <td>Record Python call stack (True / False)</td> <td><code>True</code></td> </tr> <tr> <td><code>SGLANG_PROFILE_RECORD_SHAPES</code></td> <td>Record operator input shapes (True / False)</td> <td><code>True</code></td> </tr> </tbody> </table>SGLang provides four collection methods. The core differences are whether you
need to manually send /start_profile and /stop_profile. All four methods
produce identical results — choose the most convenient one.
Method comparison:
<table> <thead> <tr> <th>Method</th> <th>Manual start_profile</th> <th>Manual stop_profile</th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td>A: API manual start/stop</td> <td>Yes</td> <td>Yes</td> <td>Maximum flexibility for precise control</td> </tr> <tr> <td>B: API auto-stop</td> <td>Yes</td> <td>No</td> <td>Set <code>num_steps</code>, auto-stops and generates output</td> </tr> <tr> <td>C: bench_serving --profile</td> <td>No</td> <td>No</td> <td>Benchmark + profiling in one command</td> </tr> <tr> <td>D: sglang.profiler CLI</td> <td>No</td> <td>No</td> <td>Standalone profiling CLI tool</td> </tr> </tbody> </table>Send /start_profile to start → send workload requests → send /stop_profile
to stop. After stopping, the server automatically parses the data — no need to
manually call analyse().
# Step 1: Start profiling (no num_steps, requires manual stop)
curl -X POST http://127.0.0.1:30000/start_profile \
-H "Content-Type: application/json" \
-d '{
"output_dir": "./sglang_profile",
"start_step": 1,
"activities": ["CPU", "GPU"]
}'
# Step 2: Send workload requests (using curl as example)
curl http://127.0.0.1:30000/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}'
# Step 3: Stop profiling
curl -X POST http://127.0.0.1:30000/stop_profile
Specify num_steps in the /start_profile request. Profiling stops
automatically after N steps and generates output — no need to manually send
/stop_profile.
# num_steps=10, wait 3 warmup steps, auto-stop after 10 steps
curl -X POST http://127.0.0.1:30000/start_profile \
-H "Content-Type: application/json" \
-d '{
"output_dir": "./sglang_profile",
"start_step": 3,
"num_steps": 10,
"activities": ["CPU", "GPU"]
}'
# Just send workload — no /stop_profile needed
curl http://127.0.0.1:30000/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello", "sampling_params": {"max_new_tokens": 32}}'
Use SGLang's built-in bench_serving with the --profile flag.
Automatically handles /start_profile and /stop_profile — no manual API
calls needed.
# With --profile-steps: auto-stops after N steps and generates output
python -m sglang.bench_serving \
--backend sglang \
--base-url http://127.0.0.1:30000 \
--model /path/to/your/model \
--tokenizer /path/to/your/model \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 100 \
--num-prompts 10 \
--profile \
--profile-steps 10 \
--profile-output-dir ./sglang_profile
# Without --profile-steps: /stop_profile sent automatically after benchmark
python -m sglang.bench_serving \
--backend sglang \
--base-url http://127.0.0.1:30000 \
--model /path/to/your/model \
--tokenizer /path/to/your/model \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 100 \
--num-prompts 10 \
--profile \
--profile-output-dir ./sglang_profile
bench_serving --profile parameters:
Use the sglang.profiler CLI module, which automatically sends
/start_profile and waits for completion. Start sglang.profiler first,
then send inference requests (otherwise there are no steps to capture and the
profiler will wait indefinitely).
# Terminal 1: Start sglang.profiler first (sends /start_profile, then waits for completion)
python3 -m sglang.profiler \
--url http://127.0.0.1:30000 \
--output-dir ./my_profiles \
--num-steps 3 \
--cpu --gpu &
# Terminal 2: Immediately send inference requests to provide steps for profiling
curl http://127.0.0.1:30000/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello", "sampling_params": {"max_new_tokens": 32}}'
A simpler and more reliable approach is to use bench_serving --profile, which
handles both steps automatically:
python3 -m sglang.bench_serving \
--backend sglang \
--base-url http://127.0.0.1:30000 \
--model /path/to/your/model \
--tokenizer /path/to/your/model \
--dataset-name random \
--random-input-len 128 \
--random-output-len 32 \
--num-prompts 10 \
--profile \
--profile-steps 3 \
--profile-output-dir ./my_profiles
sglang.profiler CLI parameters:
All methods ultimately send a /start_profile request to the server. The full
set of supported parameters:
The server log explicitly indicates where traces are saved. You can find them via:
Profiling starts. Traces will be saved to: <path> (with profile id: <id>)[2026-05-19 13:23:15] Profiling starts. Traces will be saved to: /tmp/1779196995.6948605 (with profile id: 1779196995.6979997)
[2026-05-19 13:23:15] [WARNING] [350443] profiler.py: Invalid parameter export_type: None, reset it to text.
[2026-05-19 13:23:15] [WARNING] [350443] profiler.py: Invalid parameter export_type: None, reset it to text.
[2026-05-19 13:23:15] INFO: 127.0.0.1:40714 - "POST /start_profile HTTP/1.1" 200 OK
Profiling done. Traces are saved to: <path>[2026-05-19 13:23:17] Stop profiling...
[2026-05-19 13:23:17] [WARNING] [350443] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
[rank0]:[W519 13:23:17.084812760 compiler_depend.ts:3136] Warning: The indexFromRank 0is not equal indexFromCurDevice 4 , which might be normal if the number of devices on your collective communication server is inconsistent.Otherwise, you need to check if the current device is correct when calling the interface.If it's incorrect, it might have introduced an error. (function operator())
[2026-05-19 13:23:17] [INFO] [352725] profiler.py: Start parsing profiling data: /tmp/1779196995.6948605/localhost.localdomain_350443_20260519132315700_ascend_pt
[2026-05-19 13:23:22] [INFO] [352734] profiler.py: CANN profiling data parsed in a total time of 0:00:04.022310
[2026-05-19 13:23:32] [INFO] [352725] profiler.py: All profiling data parsed in a total time of 0:00:14.305669
[2026-05-19 13:23:32] Profiling done. Traces are saved to: /tmp/1779196995.6948605
sglang.profiler outputs Dump profiling traces to <path>Dump profiling traces to /tmp/1779243331.3219
Waiting for 10 steps and the trace to be flushed.... (profile_by_stage=False)
The directory structure is
<output_dir>/<hostname>_<pid>_<timestamp>_ascend_pt/. When using Method C
(bench_serving --profile), a timestamp subdirectory is added:
<output_dir>/<timestamp>/. Always check the server log for the exact path:
Profiling done. Traces are saved to: <path>.
After profiling stops (either /stop_profile returns or num_steps
auto-triggers), the server automatically parses the raw data. The
ASCEND_PROFILER_OUTPUT directory directly contains the following visualization
files — no need to manually call analyse():
If you need to re-parse existing data with different parameters, or if
profiling was interrupted and ASCEND_PROFILER_OUTPUT was not auto-generated,
use torch_npu's analyse() tool:
from torch_npu.profiler.profiler import analyse
analyse("./sglang_profile/<hostname>_*_ascend_pt/")
Profiling starts. Traces will be saved to: <path> and
Profiling done. Traces are saved to: <path>, or sglang.profiler output for
Dump profiling traces to <path>.--num-prompts and --random-output-len to avoid trace files too large
for browsers.start_step to skip the first few warmup steps and
capture performance data under steady state.num_steps or --profile-steps can
lead to lengthy profiling data parsing times. Reduce these values
appropriately when you only need a quick overview.--disable-cuda-graph when starting the server. Note that this
reduces decode performance — only use during profiling. To analyze CUDA Graph
capture specifically, use --enable-profile-cuda-graph — traces are saved to
SGLANG_TORCH_PROFILER_DIR/graph_capture_profile/.merge_profiles feature has
limited support — check *_ascend_pt/ASCEND_PROFILER_OUTPUT/trace_view.json
on each node individually. In PD disaggregation mode, prefill and decode
workers must be profiled separately — see
Profile In PD Disaggregation Mode.