examples/profile/README.md
End-to-end GRPO runs that enable one of verl's profilers so you can capture a performance/memory trace without authoring a bespoke launcher. All scripts use the current verl.trainer.main_ppo entry point and the current Hydra API.
| Script | Profiler | Model | Infer | Train | Platform |
|---|---|---|---|---|---|
run_qwen3_8b_npu_profile_e2e.sh | NPU (E2E) | Qwen3-8B | vLLM | FSDP | NPU |
run_qwen3_8b_npu_profile_discrete.sh | NPU (discrete) | Qwen3-8B | vLLM | FSDP | NPU |
run_qwen2_5_vl_7b_torch_memory.sh | torch_memory | Qwen2.5-VL-7B | SGLang | FSDP | NVIDIA |
*_profile_e2e.sh — one end-to-end timeline for all ranks.*_profile_discrete.sh — per-stage (rollout/ref/actor) discrete traces.Controlled via global_profiler.tool=npu, global_profiler.steps=[...], global_profiler.save_path=..., plus per-role actor_rollout_ref.*.profiler.* overrides. Override any of PROFILE_STEPS, PROFILE_SAVE_PATH, PROFILE_LEVEL, PROFILE_CONTENTS, PROFILE_DISCRETE, PROFILE_RANKS_ALL to adjust behavior.
run_qwen2_5_vl_7b_torch_memory.sh dumps torch.cuda._record_memory_history snapshots to global_profiler.save_path (default ./mem_snapshots). Load the .pickle in PyTorch's memory viz UI. Override TRACE_ALLOC_MAX_ENTRIES, STACK_DEPTH, PROFILE_SAVE_PATH as needed.VAR=${VAR:-default} for MODEL_PATH, batch sizes, learning rate, rollout TP, profile options, etc.trainer.balance_batch=True are enabled by default.ppo_megatron_trainer.yaml, ppo_micro_batch_size, data.val_batch_size, top-level reward_model.*, actor.ulysses_sequence_parallel_size).