optional-skills/mlops/slime/references/troubleshooting.md
Symptoms: Inference engine dies mid-training, connection errors
Solutions:
--use-fault-tolerance
--sglang-mem-fraction-static 0.85 # Increase from 0.8
--rollout-batch-size 16 # Reduce from 32
--sglang-disable-cuda-graph
Symptoms: Some SGLang engines overloaded while others idle
Solutions:
--sglang-router-strategy round_robin
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
Symptoms: Training hangs after rollout, timeout errors
Solutions:
--update-weights-interval 5 # Increase from 2
--colocate
# Verify InfiniBand is enabled
ibstat
Symptoms: Nodes fail to receive updated weights
Solutions:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
export NCCL_TIMEOUT=1800
Symptoms: CUDA OOM in backward pass
Solutions:
--recompute-activations
--micro-batch-size 1
--sequence-parallel
--global-batch-size 128 # Reduce from 256
Symptoms: OOM when both training and inference run on same GPUs
Solutions:
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
--offload-optimizer-states
--seq-length 2048 # Reduce from 4096
Symptoms: GPU idle during data fetch, low GPU utilization
Solutions:
--num-data-workers 4
--streaming-data
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data
Symptoms: KeyError, missing fields, parsing failures
Solutions:
import json
with open("data.jsonl") as f:
for line in f:
data = json.loads(line)
assert "prompt" in data, "Missing prompt field"
assert "label" in data, "Missing label field"
--input-key prompt # Must match your data
--label-key label # Must match your data
Symptoms: Loss becomes NaN or explodes
Solutions:
--lr 1e-6 # Reduce from 5e-6
--clip-grad 1.0
# Verify no empty prompts or responses
for sample in dataset:
assert len(sample["prompt"]) > 0
--bf16 # More numerically stable
Symptoms: Reward drops to zero, model outputs garbage
Solutions:
--kl-loss-coef 0.01 # Increase from 0.001
--n-samples-per-prompt 4 # Reduce from 8
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}") # Should be reasonable
Symptoms: Error when using --colocate with train_async.py
Solution: Colocated mode is NOT supported for async training. Use separate GPUs:
# Remove --colocate flag
python train_async.py \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
# No --colocate
Symptoms: Policy divergence, inconsistent behavior
Solutions:
--async-buffer-size 2 # Reduce from 4
--update-weights-interval 1 # Sync every rollout
Symptoms: Model learns to output tool responses verbatim
Solution: Properly set loss mask in custom generate function:
def build_loss_mask(sample):
"""Create loss mask that excludes tool responses."""
mask = []
for i, token in enumerate(sample.tokens):
if is_tool_response(token, sample.metadata):
mask.append(0) # Don't compute loss
else:
mask.append(1) # Compute loss
return mask
Symptoms: OOM or truncation in multi-turn conversations
Solutions:
# In custom generate function
conversation = sample.prompt[-10:] # Keep last 10 turns
--sglang-context-length 16384
Symptoms: Cannot load saved checkpoint
Solutions:
ls -la /path/to/checkpoint/
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
python tools/convert_hf_to_megatron.py \
--hf_model_path /path/to/hf/model \
--save_path /path/to/megatron/checkpoint
--log-level DEBUG
export SLIME_DEBUG=1
watch -n 1 nvidia-smi
tensorboard --logdir outputs/
# Test reward function
import asyncio
from custom_rm import reward_func
async def test():
sample = Sample(prompt="test", response="test", label="expected")
reward = await reward_func(args, sample)
print(f"Reward: {reward}")
asyncio.run(test())
Key constraint to remember:
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
examples/ directory