======== CpuTrace

CpuTrace is a developer tool that measures the CPU cost of execution. It is useful when deciding between algorithms for new code and for validating performance enhancements. CpuTrace measures CPU instructions, clock cycles, branch mispredictions, cache misses and thread reschedules.

Integration into Ceph

To enable CpuTrace, build with the WITH_CPUTRACE flag:

.. code-block:: bash

./do_cmake.sh -DWITH_CPUTRACE=1

Once built with CpuTrace support, you can annotate specific functions or code regions using the provided macros and helper classes.

To enable profiling in your code, include the CpuTrace header:

.. code-block:: cpp

#include "common/cputrace.h"

Then you can mark functions for profiling using the provided helpers.

Raw counter mode

CpuTrace is using the Linux perf_event_open syscall. You can use the tool as a simple helper to get access to hardware perf counters.

.. code-block:: cpp

// I am profiling my code and want to know // how many clock cycles and how many thread switches it takes HW_ctx hw = HW_ctx_empty; HW_init(&hw, HW_PROFILE_SWI|HW_PROFILE_CYC); sample_t start, end; HW_read(&hw, &start); // my code starts // ..... // my code ends HW_read(&hw, &end); // task_switches = end.swi - start.swi; // clock_cycles = end.cyc - start.cyc; HW_clean(&hw);

By inspecting task_switches and clock_cycles the developer can learn that real clock execution time of 10ms has only 1M clock cycles, but had 2 task switches.

Aggregating samples

A single readout of execution time is usually not enough. We need more samples to get a more realistic measurement of actual execution cost.

.. code-block:: cpp

// a variable to hold my measurement static measurement_t my_code_time; sample_t start, end, elapsed; // hw initialized somewhere else HW_read(&hw, &start); // my code starts // ..... // my code ends HW_read(&hw, &end); elapsed = end - start; // add new sample to the whole measurement my_code_time.sample(elapsed);

`measurement_t`

The measurement_t type aggregates collected samples and counts the number of measurements performed.

It produces summary statistics that include:

count : total number of measurements
average : mean value across all samples
zero / non-zero split : how many measurements were exactly zero versus greater than zero (only for context switch metrics)

These statistics provide a compact and clear view of performance measurements.

measurement_t can also export results in two formats:

Ceph Formatter (for structured JSON/YAML/XML output):

.. code-block:: cpp

ceph::Formatter* jf; m->dump(jf, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
String stream (for plain-text logging):

.. code-block:: cpp

std::stringstream ss; m->dump_to_stringstream(ss, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output std::cout << ss.str();

This makes it easy to either integrate measurements into Ceph’s structured output pipeline or dump them as human-readable text for debugging.

RAII samples

It is usually most convenient to use RAII to collect samples. With RAII, measurement begins automatically when the guard object is created and ends when it goes out of scope, so no explicit start/stop calls are required.

The hardware context (HW_ctx) must be initialized once before creating guards. After initialization, the same context can be reused across multiple measurements.

HW_guard takes two arguments:

HW_ctx* ctx Pointer to the initialized hardware context.
measurement_t* m Pointer to the measurement object where results will be stored.

Example:

.. code-block:: cpp

// variable to hold measurement results static measurement_t my_code_time; { HW_guard guard(&hw, &my_code_time); // code to be measured // ... }

Named measurements

Code regions can be measured using a named guard. Each HW_named_guard automatically starts measurement at construction and stops when leaving scope.

.. code-block:: cpp

{ HW_named_guard("function", &hw); // my code starts // ... // my code ends }

This example records the execution time of function.

The guard requires a pointer to a previously initialized HW_ctx. This context must be created and set up (e.g., during program initialization) before guards can be used.

Named guards provide a simple and consistent way to track performance metrics.

To later access the collected measurements for a given name, use:

.. code-block:: cpp

measurement_t* m = get_named_measurement("function"); if (m) { // inspect m->sum_cyc, m->sum_ins. // m->dump_to_stringstream(ss, HW_PROFILE_INS|HW_PROFILE_CYC); }

Groups of measurements

Keeps all measurements together. Allows to very easily add a CPU probe. This method is limited to measuring only scopes of execution, where RAII rules can be used.

Define your measurements group:

.. code-block:: cpp

cpucounter_group BlueStore::cputrace_bluestore("bluestore");

And put some probes:

.. code-block:: cpp

MEASURE_SCOPE(cputrace_bluestore, txc_state_proc)

The values are easily available for read and reset via admin socket commands. Unlike named measurements, probes in groups cannot be stopped and started.

Admin socket integration

In addition to direct instrumentation in code, CpuTrace can also be controlled at runtime via the admin socket interface. This allows developers to start, stop, and inspect profiling in running Ceph daemons without rebuilding or restarting them.

To profile a function, annotate it with the provided macros:

.. code-block:: cpp

HWProfileFunctionF(profile, func, HW_PROFILE_CYC | HW_PROFILE_CMISS | HW_PROFILE_INS | HW_PROFILE_BMISS | HW_PROFILE_SWI);

profile is a local variable name for the profiler object and only needs to be unique within the profiling scope.
__func__ (or any string you pass as the name) is the unique anchor name for this profiling scope.

Each unique name creates a separate anchor. Reusing the same name in multiple places will trigger an assertion failure.

This macro automatically attaches a profiler to the function scope and collects the specified hardware counters each time the function executes.

You can combine any of the available flags:

HW_PROFILE_CYC – CPU cycles
HW_PROFILE_CMISS – Cache misses
HW_PROFILE_BMISS – Branch mispredictions
HW_PROFILE_INS – Instructions retired
HW_PROFILE_SWI – Context switches

Available commands:

cputrace start – Start profiling with the configured groups/counters
cputrace stop – Stop profiling and freeze results
cputrace dump – Dump all collected metrics (as JSON or plain text)
cputrace reset – Reset all captured data

Profiling counters are cumulative. cputrace stop pauses profiling without resetting values. cputrace start resumes accumulation. Use cputrace reset to clear all collected metrics.

Example usage from the command line:

.. code-block:: bash

Start profiling on OSD.0

ceph tell osd.0 cputrace start

Stop profiling

ceph tell osd.0 cputrace stop

Dump results

ceph tell osd.0 cputrace dump

Reset counters

ceph tell osd.0 cputrace reset

These commands can be repeated multiple times: developers typically start before a workload, stop afterwards, and then dump the results to analyze them.

cputrace dump supports optional arguments to filter by logger or counter, so only a subset of metrics can be reported when needed.

cputrace reset clears all data, preparing for a fresh round of profiling.

API Reference

Enums


.. code-block:: cpp

  enum cputrace_flags {
      HW_PROFILE_SWI   = (1ULL << 0), // Context switches
      HW_PROFILE_CYC   = (1ULL << 1), // CPU cycles
      HW_PROFILE_CMISS = (1ULL << 2), // Cache misses
      HW_PROFILE_BMISS = (1ULL << 3), // Branch mispredictions
      HW_PROFILE_INS   = (1ULL << 4), // Instructions retired
  };

The bitwise ``|`` operator may be used to combine these flags.

Data structures

sample_t – holds a single hardware counter snapshot.

.. code-block:: cpp

struct sample_t { uint64_t swi; //context switches uint64_t cyc; //clock cycles uint64_t cmiss; //cache misses uint64_t bmiss; //branch misses uint64_t ins; //instructions };

measurement_t – accumulates multiple samples and computes totals/averages and other useful metrics.

.. code-block:: cpp

struct measurement_t { uint64_t call_count = 0; uint64_t sample_count = 0; uint64_t sum_swi = 0, sum_cyc = 0, sum_cmiss = 0, sum_bmiss = 0, sum_ins = 0; uint64_t non_zero_swi_count = 0; uint64_t zero_swi_count = 0; };

HW_ctx – encapsulates perf-event file descriptors for one measurement context.

.. code-block:: cpp

extern HW_ctx HW_ctx_empty;

Low-level API


- ``void HW_init(HW_ctx* ctx, cputrace_flags flags)`` – initialize perf counters.
- ``void HW_read(HW_ctx* ctx, sample_t* out)`` – read current counter values.
- ``void HW_clean(HW_ctx* ctx)`` – release perf counters.