docs/developer_guide/msprobe_debugging_guide.md
MSProbe is a debugging tool for AI models that diagnoses accuracy anomalies and numerical errors during model training and inference. It captures and monitors intermediate data (feature maps, weights, activations, layer outputs) and contextual metadata (prompts, tensor dtypes, hardware configuration), and supports visual analysis to systematically trace the root cause of accuracy degradation or numerical errors (e.g., NaN/Inf, output drift, mismatched predictions).
MSProbe supports three accuracy levels for data dumping, each for different debugging needs:
construct.json (for network structure
reconstruction in visualization). Requires passing a model/submodule handle.Install MSProbe with pip:
pip install mindstudio-probe --pre
MSProbe uses a JSON configuration file for customized data dumping. All core parameters are listed in the table below, with the default JSON configuration provided for reference.
| Field | Description | Required |
|---|---|---|
task | Type of dump task. Common PyTorch values include "statistics" and "tensor". A statistics task collects tensor statistics (mean, variance, max, min, etc.) while a tensor task captures arbitrary tensors. | Yes |
dump_path | Directory where dump results are stored. When omitted, MSProbe uses its default path. | No |
rank | Ranks to sample. An empty list collects every rank. For single-card tasks you must set this field to []. | No |
step | Token iteration(s) to sample. An empty list means every iteration. | No |
level | Dump level string ("L0", "L1", or "mix"). L0 targets nn.Module, L1 targets torch.api, and mix collects both. | Yes |
async_dump | Whether to enable asynchronous dump (supported for PyTorch statistics/tensor tasks). Defaults to false. | No |
scope | Customize the scope of dump. Provide two module or API names that follow the tool's naming convention to lock a range, only data between the two names will be dumped. An empty list dumps every module or torch API. |
Examples:
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
The level setting determines what can be provided—modules when level=L0, APIs when level=L1, and either modules or APIs when level=mix. | No |
| list | Customize dump list, only dumps elements from the list. An empty list dumps every module or torch API. Options include:
Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: "list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"].
When level=mix, you can provide module names so that the dump expands to everything produced while the module is running. Example: "list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"].
Provide a substring such as "list": ["relu"] to dump every API whose name contains the substring. When level=mix, modules whose names contain the substring are also expanded. | No |
{
"task": "statistics",
"dump_path": "./dump_path",
"rank": [],
"step": [],
"level": "L1",
"async_dump": false,
"statistics": {
"scope": [],
"list": [],
"data_mode": [
"all"
],
"summary_mode": "statistics"
},
"tensor": {
"scope": [],
"list": [],
"data_mode": [
"all"
],
"file_format": "npy"
},
"acc_check": {
"white_list": [],
"black_list": [],
"error_data_path": "./"
}
}
Dump files are written into dump_path you defined. They usually contain:
dump.json, which records metadata such as dtype, shape, min, max, mean, L2 norm, and requires_grad.construct.json, hierarchical structure description, when level is L0 or mix (required for visualization), its
content is not empty.stack.json, record the call stack information of API/Module.dump_tensor_data, generated when task is tensor and save the collected tensor data.See dump directory description for details.
Note: When MSProbe is enabled, cuda graph is disabled(disable_cuda_graph=True) because MSProbe only supports dump in eager mode, warmup is disabled(skip_server_warmup=True) because there is no need to dump data for this stage.
MSProbe’s full debugging workflow follows Enable → Collect Data → Visualize → Analyze Root Cause. Below is a common E2E example for SGLang-based model inference debugging.
Suitable for targeted debugging (e.g., only collect statistics data for specific ranks/steps, enable mix level for graph reconstruction + numerical comparison) and root cause analysis via problem vs. benchmark comparison.
Create msprobe-config.json (dump statistics data for rank0/1, step0/1, mix level):
{
"task": "statistics",
"dump_path": "./problem_dump",
"rank": [
0,
1
],
"step": [
0,
1
],
"level": "mix",
"async_dump": false,
"statistics": {
"scope": [],
"list": [],
"data_mode": [
"all"
],
"summary_mode": "statistics"
}
}
Launch the SGLang server and specify the configuration file path with --msprobe-dump-config:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-0.5B-Instruct \
--host 127.0.0.1 \
--port 1027 \
--msprobe-dump-config /home/msprobe-config.json
Send normal inference requests to trigger model running (MSProbe automatically collects data during request processing):
curl -H "Content-type: application/json" \
-X POST \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{
"role": "user",
"content": "Hello, my name is"
}
],
"max_tokens": 10
}' \
http://127.0.0.1:1027/v1/chat/completions
./problem_dump../bench_dump.Key Requirement: Problem and benchmark dumps must use the same inputs and sampling points (rank/step) for valid comparison.
Dump files are saved to ./problem_dump and ./bench_dump you defined and include core files for subsequent analysis:
dump.json: Records tensor metadata of APIs and modules (dtype, shape, min/max/mean, L2 norm, requires_grad, etc.).stack.json: Logs call stack information of APIs and modules.construct.json: hierarchical structure description, required for visualization, its content is not empty.Generate a multi-rank comparison visualization file (mix level generates construct.json for graph reconstruction):
msprobe graph_visualize -tp ./problem_dump/step0 -gp ./bench_dump/step0 -o ./graph_output
-tp: Path to problem-side dump data-gp: Path to benchmark-side dump data-o: Output directory for visualization filesIf you want overflow check (for NaN/Inf detection), please specify the parameter -oc
msprobe graph_visualize -tp ./problem_dump/step0 -gp ./bench_dump/step0 -o ./graph_output -oc
After the comparison or build task finishes, a compare_{timestamp}.vis.db file is created under graph_output.
Start TensorBoard:
tensorboard --logdir ./graph_output --bind_all --port 6006
Root Cause Analysis in TensorBoard:
After locating the divergent node (e.g., a specific Conv layer or torch API with abnormal tensor values), verify by:
scope/list in the configuration file) for fine-grained data collection.pip show mindstudio_probe to troubleshoot. If it is installed, the MSProbe
version information will be printed. If it is confirmed that it has not been installed, please
use pip install mindstudio-probe --pre for installation;--msprobe-dump-config parameter points to the correct JSON file path.task: "statistics" instead of "tensor" to collect only tensor statistics (avoids raw tensor dump);scope field (specify start/end module/API);list field (only dump specific modules/APIs or substrings);rank and step (avoid dumping all ranks/iterations).construct.json is not empty (requires level: L0 or mix – L1 does not generate graph files);-tp (problem dump) and -gp (benchmark dump) paths point to valid rank/step subdirectories (
e.g., srep0/rank0);pip install mindstudio-probe --pre --upgrade);--logdir parameter points to the directory containing .vis.db files (not
the file itself).step range (check more token iterations for late-stage divergence);task: "tensor" (statistics may mask subtle numerical differences in raw tensor data);manual mapping feature in TensorBoard (automatic mapping may miss some nodes for custom models).├── problem_dump or bench_dump
│ ├── step0
│ │ ├── rank0
│ │ │ ├── dump_tensor_data
│ │ │ │ ├── Tensor.permute.1.forward.pt
│ │ │ │ ├── Functional.linear.5.backward.output.pt # Format: {api_type}.{api_name}.{call_count}.{forward/backward}.{input/output}.{arg_index}.
│ │ │ │ │ # arg_index is the nth input or output of the API. If an input is a list, keep numbering with decimals (e.g., 1.1 is the first element of the first argument).
│ │ │ │ ├── Module.conv1.Conv2d.forward.0.input.0.pt # Format: {Module}.{module_name}.{class_name}.{forward/backward}.{call_count}.{input/output}.{arg_index}.
│ │ │ │ ├── Module.conv1.Conv2d.forward.0.parameters.bias.pt # Module parameter data: {Module}.{module_name}.{class_name}.forward.{call_count}.parameters.{parameter_name}.
│ │ │ │ └── Module.conv1.Conv2d.parameters_grad.weight.pt # Module parameter gradients: {Module}.{module_name}.{class_name}.parameters_grad.{parameter_name}. Gradients do not include call_count because the same gradient updates all invocations.
│ │ │ │ # When the `model` argument passed to dump is a List[torch.nn.Module] or Tuple[torch.nn.Module], module-level data names also include the index inside the list ({Module}.{index}.*), e.g., Module.0.conv1.Conv2d.forward.0.input.0.pt.
│ │ │ ├── dump.json
│ │ │ ├── stack.json
│ │ │ ├── dump_error_info.log
│ │ │ └── construct.json
│ │ ├── rank1
│ │ │ ├── dump_tensor_data
│ │ │ │ └── ...
│ │ │ ├── dump.json
│ │ │ ├── stack.json
│ │ │ ├── dump_error_info.log
│ │ │ └── construct.json
│ │ ├── ...
│ │ │
│ │ └── rank7
│ ├── step1
│ │ ├── ...
│ ├── step2
rank: Device ID. Each card writes its data to the corresponding rank{ID} directory. In non-distributed scenarios
the directory is simply named rank.dump_tensor_data: Save the collected tensor data.dump.json: Statistics for the forward data of each API or module, including names, dtype, shape, max, min, mean, L2
norm (square root of the L2 variance), and CRC-32 when summary_mode="md5".
See dump.json file description for details.dump_error_info.log: Present only when the dump tool encountered an error and records the failure log.stack.json: Call stacks for APIs/modules.construct.json: Hierarchical structure description. Empty when level=L1.An L0 dump.json contains forward/backward I/O for modules together with parameters and parameter gradients. Using
PyTorch's Conv2d as an example, the network code looks like:
output = self.conv2(input) # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)
dump.json contains the following entries:
Module.conv2.Conv2d.forward.0: Forward data of the module. input_args represents positional inputs, input_kwargs
represents keyword inputs, output stores forward outputs, and parameters stores weights/biases.Module.conv2.Conv2d.parameters_grad: Parameter gradients (weight and bias).Module.conv2.Conv2d.backward.0: Backward data of the module. input represents gradients that flow into the
module (gradients of the forward outputs) and output represents gradients that flow out (gradients of the module
inputs).Note: When the model parameter passed to the dump API is List[torch.nn.Module] or Tuple[torch.nn.Module],
module-level names include the index inside the list ({Module}.{index}.*). Example: Module.0.conv1.Conv2d.forward.0.
{
"task": "tensor",
"level": "L0",
"framework": "pytorch",
"dump_data_dir": "/dump/path",
"data": {
"Module.conv2.Conv2d.forward.0": {
"input_args": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
8,
16,
14,
14
],
"Max": 1.638758659362793,
"Min": 0.0,
"Mean": 0.2544615864753723,
"Norm": 70.50277709960938,
"requires_grad": true,
"data_name": "Module.conv2.Conv2d.forward.0.input.0.pt"
}
],
"input_kwargs": {},
"output": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
8,
32,
10,
10
],
"Max": 1.6815717220306396,
"Min": -1.5120246410369873,
"Mean": -0.025344856083393097,
"Norm": 149.65576171875,
"requires_grad": true,
"data_name": "Module.conv2.Conv2d.forward.0.output.0.pt"
}
],
"parameters": {
"weight": {
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32,
16,
5,
5
],
"Max": 0.05992485210299492,
"Min": -0.05999220535159111,
"Mean": -0.0006165213999338448,
"Norm": 3.421217441558838,
"requires_grad": true,
"data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt"
},
"bias": {
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32
],
"Max": 0.05744686722755432,
"Min": -0.04894155263900757,
"Mean": 0.006410328671336174,
"Norm": 0.17263513803482056,
"requires_grad": true,
"data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
}
}
},
"Module.conv2.Conv2d.parameters_grad": {
"weight": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32,
16,
5,
5
],
"Max": 0.018550323322415352,
"Min": -0.008627401664853096,
"Mean": 0.0006675920449197292,
"Norm": 0.26084786653518677,
"requires_grad": false,
"data_name": "Module.conv2.Conv2d.parameters_grad.weight.pt"
}
],
"bias": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32
],
"Max": 0.014914230443537235,
"Min": -0.006656786892563105,
"Mean": 0.002657240955159068,
"Norm": 0.029451673850417137,
"requires_grad": false,
"data_name": "Module.conv2.Conv2d.parameters_grad.bias.pt"
}
]
},
"Module.conv2.Conv2d.backward.0": {
"input": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
8,
32,
10,
10
],
"Max": 0.0015069986693561077,
"Min": -0.001139344065450132,
"Mean": 3.3215508210560074e-06,
"Norm": 0.020567523315548897,
"requires_grad": false,
"data_name": "Module.conv2.Conv2d.backward.0.input.0.pt"
}
],
"output": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
8,
16,
14,
14
],
"Max": 0.0007466732058674097,
"Min": -0.00044813455315306783,
"Mean": 6.814070275140693e-06,
"Norm": 0.01474067009985447,
"requires_grad": false,
"data_name": "Module.conv2.Conv2d.backward.0.output.0.pt"
}
]
}
}
}
An L1 dump.json records forward/backward I/O for APIs. Using PyTorch's relu function as an
example (output = torch.nn.functional.relu(input)), the file contains:
Functional.relu.0.forward: Forward data of the API. input_args are positional inputs, input_kwargs are keyword
inputs, and output stores the forward outputs.Functional.relu.0.backward: Backward data of the API. input represents the gradients of the forward outputs,
and output represents the gradients that flow back to the forward inputs.{
"task": "tensor",
"level": "L1",
"framework": "pytorch",
"dump_data_dir": "/dump/path",
"data": {
"Functional.relu.0.forward": {
"input_args": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32,
16,
28,
28
],
"Max": 1.3864083290100098,
"Min": -1.3364859819412231,
"Mean": 0.03711778670549393,
"Norm": 236.20692443847656,
"requires_grad": true,
"data_name": "Functional.relu.0.forward.input.0.pt"
}
],
"input_kwargs": {},
"output": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32,
16,
28,
28
],
"Max": 1.3864083290100098,
"Min": 0.0,
"Mean": 0.16849493980407715,
"Norm": 175.23345947265625,
"requires_grad": true,
"data_name": "Functional.relu.0.forward.output.0.pt"
}
]
},
"Functional.relu.0.backward": {
"input": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32,
16,
28,
28
],
"Max": 0.0001815402356442064,
"Min": -0.00013352684618439525,
"Mean": 0.00011915402356442064,
"Norm": 0.007598237134516239,
"requires_grad": false,
"data_name": "Functional.relu.0.backward.input.0.pt"
}
],
"output": [
{
"type": "torch.Tensor",
"dtype": "torch.float32",
"shape": [
32,
16,
28,
28
],
"Max": 0.0001815402356442064,
"Min": -0.00012117840378778055,
"Mean": 2.0098118724831693e-08,
"Norm": 0.006532244384288788,
"requires_grad": false,
"data_name": "Functional.relu.0.backward.output.0.pt"
}
]
}
}
}
A mix dump.json contains both L0 and L1 level data; the file format is the same as the examples above.