Scalene-Agents.md
python3 -m scalene --cli --json --outfile profile.json script.py
Scalene's key differentiator is separating Python time from C/native time:
n_cpu_percent_python: Time spent executing Python bytecode - this is your optimization targetn_cpu_percent_c: Time spent in C extensions/native code - generally NOT optimizable at the Python levelCritical insight: Focus optimization efforts on code with high Python time. Code spending most of its time in C is already running native code and can only be improved by:
For example, Python's slice reversal perm[:k+1] = perm[k::-1] shows high C time because it's implemented in C. Replacing it with a Python loop makes performance worse because it moves work from fast C code to slow Python code.
Scalene tracks detailed memory behavior:
n_malloc_mb: Memory allocated on this linen_peak_mb: Peak memory usage attributed to this linen_avg_mb: Average memory footprintn_growth_mb: Net memory growth (allocations minus frees) - useful for detecting memory leaksn_usage_fraction: Fraction of total memory used by this lineWhen to focus on memory:
n_growth_mb with n_python_fraction near 1.0 indicates Python objects accumulating (potential leak)n_peak_mb suggests opportunities to reduce memory footprint by processing data in chunksn_copy_mb_s: Rate of memory copying in MB/s attributed to this lineWhy this matters: High copy volume indicates inefficient data handling:
join() or io.StringIO)Example: A line showing 100+ MB/s copy volume in a data processing loop suggests refactoring to avoid intermediate copies.
n_gpu_percent: Percentage of GPU timen_gpu_avg_memory_mb: Average GPU memory usagen_gpu_peak_memory_mb: Peak GPU memory usagen_sys_percent: Time spent in system calls (I/O, etc.)High system time may indicate:
--stacks)When run with --stacks, Scalene records three top-level stack views in the JSON profile. They share the same CPU samples but expose different slices of each one:
stacks — Python-only call chains, filtered to user-traceable frames.native_stacks — C/C++ frames from the interrupted thread, captured by Scalene's signal-handler unwinder. Each entry is a list of [module, symbol, ip, offset] frames (innermost-first), trimmed to drop Scalene's own handler frames at the leaf and CPython interpreter / process-entry frames at the root.combined_stacks — Stitched Python + native chains for the same sample. Each frame is a structured dict so the seam is explicit:{"kind": "py", "display_name": "hot", "filename_or_module": "/app/work.py", "line": 42, "ip": null, "offset": null}
{"kind": "native", "display_name": "cblas_dgemm", "filename_or_module": "/lib/libBLAS.so", "line": null, "ip": 140735, "offset": 32}
Frames are stored outermost-first (caller → callee). The Python segment runs from the program entry point down through user functions; the native segment picks up where Python called into C and runs to the actual interrupted leaf. The seam between the two ends with the deepest user-traceable Python frame and starts with the first native frame outside CPython's interpreter loop.
How to read it:
py frames means the sample landed in pure Python — no native code was running at the moment of interrupt.native frame from a known library (numpy, BLAS, lxml, etc.) means time was actually spent in that library's C code, called from the listed Python frame. This is information stacks alone cannot show — the Python eval loop has already returned by the time the Python signal handler runs.combined_stacks entries with the same Python prefix and different native leaves are normal: the same Python call site routes work into different C functions.When combined_stacks adds information beyond n_cpu_percent_c: the per-line n_cpu_percent_c tells you how much of a line's time is in C. combined_stacks tells you which C function. If a line shows 85% C time, the stitched stack is what shows whether it's BLAS, regex, JSON parsing, or something else.
Caveats:
combined_stacks is best-effort: if multiple native stacks are drained for one Python sample, each is attached to the same Python anchor (v1 policy). Hit counts are reliable; the per-stack breakdown is approximate.dladdr show empty display_name / filename_or_module and a non-zero ip.combined_stacks will be empty.Check Python vs C time split
Check memory growth
Check copy volume
Check system time