Scalene Profiling Guide

Basic Usage

bash

python3 -m scalene --cli --json --outfile profile.json script.py

Understanding Python vs C (Native) Time

Scalene's key differentiator is separating Python time from C/native time:

n_cpu_percent_python: Time spent executing Python bytecode - this is your optimization target
n_cpu_percent_c: Time spent in C extensions/native code - generally NOT optimizable at the Python level

Critical insight: Focus optimization efforts on code with high Python time. Code spending most of its time in C is already running native code and can only be improved by:

Reducing the number of calls to the C code (algorithmic changes)
Using the library more efficiently (e.g., batching operations)
Switching to a different library

For example, Python's slice reversal perm[:k+1] = perm[k::-1] shows high C time because it's implemented in C. Replacing it with a Python loop makes performance worse because it moves work from fast C code to slow Python code.

Memory Profiling Metrics

Scalene tracks detailed memory behavior:

n_malloc_mb: Memory allocated on this line
n_peak_mb: Peak memory usage attributed to this line
n_avg_mb: Average memory footprint
n_growth_mb: Net memory growth (allocations minus frees) - useful for detecting memory leaks
n_usage_fraction: Fraction of total memory used by this line

When to focus on memory:

High n_growth_mb with n_python_fraction near 1.0 indicates Python objects accumulating (potential leak)
High n_peak_mb suggests opportunities to reduce memory footprint by processing data in chunks

Copy Volume Tracking

n_copy_mb_s: Rate of memory copying in MB/s attributed to this line

Why this matters: High copy volume indicates inefficient data handling:

Unnecessary string concatenation in loops (use join() or io.StringIO)
Repeated array/list copies (use views, in-place operations, or pre-allocation)
Passing large objects by value when references would suffice

Example: A line showing 100+ MB/s copy volume in a data processing loop suggests refactoring to avoid intermediate copies.

GPU Metrics (if applicable)

n_gpu_percent: Percentage of GPU time
n_gpu_avg_memory_mb: Average GPU memory usage
n_gpu_peak_memory_mb: Peak GPU memory usage

System Time

n_sys_percent: Time spent in system calls (I/O, etc.)

High system time may indicate:

Excessive file I/O (batch reads/writes)
Too many small network requests (batch API calls)
Suboptimal disk access patterns

Stack-Sample Outputs (`--stacks`)

When run with --stacks, Scalene records three top-level stack views in the JSON profile. They share the same CPU samples but expose different slices of each one:

stacks — Python-only call chains, filtered to user-traceable frames.
native_stacks — C/C++ frames from the interrupted thread, captured by Scalene's signal-handler unwinder. Each entry is a list of [module, symbol, ip, offset] frames (innermost-first), trimmed to drop Scalene's own handler frames at the leaf and CPython interpreter / process-entry frames at the root.
combined_stacks — Stitched Python + native chains for the same sample. Each frame is a structured dict so the seam is explicit:

json

{"kind": "py",     "display_name": "hot",         "filename_or_module": "/app/work.py",     "line": 42, "ip": null,    "offset": null}
{"kind": "native", "display_name": "cblas_dgemm", "filename_or_module": "/lib/libBLAS.so",  "line": null, "ip": 140735, "offset": 32}

Frames are stored outermost-first (caller → callee). The Python segment runs from the program entry point down through user functions; the native segment picks up where Python called into C and runs to the actual interrupted leaf. The seam between the two ends with the deepest user-traceable Python frame and starts with the first native frame outside CPython's interpreter loop.

How to read it:

A stack with only py frames means the sample landed in pure Python — no native code was running at the moment of interrupt.
A stack ending in a native frame from a known library (numpy, BLAS, lxml, etc.) means time was actually spent in that library's C code, called from the listed Python frame. This is information stacks alone cannot show — the Python eval loop has already returned by the time the Python signal handler runs.
Multiple combined_stacks entries with the same Python prefix and different native leaves are normal: the same Python call site routes work into different C functions.

When combined_stacks adds information beyond n_cpu_percent_c: the per-line n_cpu_percent_c tells you how much of a line's time is in C. combined_stacks tells you which C function. If a line shows 85% C time, the stitched stack is what shows whether it's BLAS, regex, JSON parsing, or something else.

Caveats:

combined_stacks is best-effort: if multiple native stacks are drained for one Python sample, each is attached to the same Python anchor (v1 policy). Hit counts are reliable; the per-stack breakdown is approximate.
Native frames whose symbols couldn't be resolved by dladdr show empty display_name / filename_or_module and a non-zero ip.
Not yet collected on Windows (the native unwinder is a stub there); combined_stacks will be empty.

Optimization Decision Tree

Check Python vs C time split
- High Python time → Optimize the Python code
- High C time → Consider algorithmic changes or different libraries
Check memory growth
- Sustained growth → Look for leaks, unbounded caches, or accumulating data structures
Check copy volume
- High copy rate → Refactor to use views, in-place operations, or pre-allocation
Check system time
- High system time → Batch I/O operations, use async I/O, or optimize file access patterns

Scalene Profiling Guide

Scalene Profiling Guide

Basic Usage

Understanding Python vs C (Native) Time

Memory Profiling Metrics

Copy Volume Tracking

GPU Metrics (if applicable)

System Time

Stack-Sample Outputs (--stacks)

Optimization Decision Tree

Stack-Sample Outputs (`--stacks`)