folly/docs/BenchmarkAdaptive.md
Traditional benchmarking runs each benchmark to completion, then reports minimum
observed time. Works on quiet systems. On noisy VMs, it fails — benchmark A
runs during a slow period, B during a fast one, distorting their relative
performance.
--bm_mode=adaptive interleaves short slices of all benchmarks in random order. All
benchmarks see the same system states. It runs until results are both stable
(not oscillating) and precise (confidence interval narrow enough).
What adaptive gives you:
What adaptive does NOT guarantee:
--bm_target_precision_pct might suggest.
But, in practice, run-to-run results are still very close.When adaptive mode converges, results closely agree with best-of mode's "good runs" — the ones where you got lucky. Best-of mode trends slightly lower since it targets minimum; adaptive targets a percentile (default: 33rd).
In summary: adaptive tries to take the luck out of within-run comparisons, at the cost of occasionally longer runtime. For cross-run comparisons (e.g., before/after your commit), run both builds under the same conditions and rely on relative differences, not absolute values.
buck2 run @//mode/opt //your:benchmark -- --bm_mode=adaptive
With default options, this converges quickly on quiet systems. On noisy VMs, increase the per-benchmark timeout:
buck2 run @//mode/opt //your:benchmark -- --bm_mode=adaptive --bm_max_secs=30
Pass --bm_target_precision_pct to control the convergence threshold — the
measurement's 95% confidence interval must be narrower than this percentage of
the estimate (the default 0.4 means precise to 0.4ns out of 100ns). Tighten
it (e.g., 0.1) if you need finer discrimination between benchmarks, but note
that binary layout effects (see below) often dominate at that level.
Watch out: a "converged" measurement can still be wrong — see the next section.
Add --bm_verbose for under-the-hood stats: sampling rounds,
convergence progress.
Use --bm_min_secs if you want higher confidence over speed — ensures each
benchmark runs at least that long.
For all flag details, see folly/Benchmark.cpp.
Aside: All comparisons are with "best-of" mode. --bm_estimate_time computes
a genuinely different value and is slower. Its utility is unclear to the author
of adaptive mode.
System-state changes lasting longer than a run (tens of seconds) are invisible. You can have a stable, converged run that doesn't match the next one.
For cross-run comparisons (e.g., comparing your commit to baseline):
Microbenchmarks are sensitive to instruction cache alignment. Innocuous code changes can shift timings by several percent — far exceeding requested accuracy.
Example: Adding this dead code to verboseLogFinal() (which never runs when
--bm_verbose is off) consistently shifted timings under
--bm_mode=adaptive --bm_max_secs=20 --bm_target_precision_pct=0.03:
std::ostringstream rawOss;
rawOss << "\n[bm_raw_samples]";
for (const auto& s : states) {
rawOss << "\n" << s.name;
for (const auto& sample : s.samples) {
rawOss << "\t" << sample.first;
}
}
LOG(INFO) << rawOss.str();
This is inherent to microbenchmarking. There is no "true" runtime for a microbenchmark independent of its binary layout. Adaptive mode improves relative comparisons within a run (different benchmarks see the same system state), but absolute values can shift between builds.
Implication: Don't obsess over 0.1% accuracy targets. Binary layout effects can easily introduce 5-20% variance. Use adaptive to get stable, precise comparisons, not "perfect" absolute measurements.
Sometimes you'll see benchmarks stuck oscillating. Even if they eventually converge, the numbers are suspect — thermal throttling, noisy neighbors, background tasks.
If you see persistent [unstable] warnings:
--bm_min_secs to force longer runs that average out short-term noiseThe default --bm_target_percentile=33.3 is a heuristic, not science.
Rationale:
Risk of low percentiles: If your code has a bimodal distribution (fast path
vs. slow path), low percentiles might hide the slow path. If you care about tail
latencies, try --bm_target_percentile=90 etc.
Comparison to best-of mode: Best-of mode reports p0 (minimum across all iterations). Adaptive's p33 will trend slightly higher but is more repeatable on noisy systems. If comparing adaptive to best-of, expect a small difference on the same binary under quiet conditions.
A benchmark is stable when first-half and second-half estimates agree.
Split samples chronologically. Compute percentile estimate and confidence interval for each half. Stable means:
E.g., after 200 samples: first-half p33 = 4.2ns ± 0.1ns, second-half p33 = 4.3ns ± 0.15ns — stable, because each estimate falls within the other's interval.
Never drops samples. As count grows, CIs shrink, halves must eventually agree (or timeout shows your system is too noisy).
Intuition: If system state is stationary, both halves converge to the same true value. When they disagree, something changed — keep collecting until they agree.
Stability ≠ accuracy. You can have a stable, precise measurement of the "wrong" number (due to binary layout, thermal throttling that persists all run, etc.). But an unstable measurement is definitely unreliable.
We target a percentile, not a mean. While costlier to compute than mean/stdev,
order statistics are robust to outliers, and don't assume a distribution.
Aside: we use neither StreamingStats.h (online mean/stdev) nor TDigest.h
(approx quantiles) because we don't have to -- see below.
Our percentile CI estimator approximates the binomial distribution, but in practice we always have enough samples to make that a good choice.
Most CPU usage should go to the benchmarks, not the harness. Choices to achieve this:
Rare convergence checks: Every 150ms, not every sample. The O(n log n)
sorting for percentile CIs is amortized, making complex machinery like
TDigest.h unnecessary for our use-case.
Downsampled baselines: Baseline (empty benchmark loop) runs every 8 rounds
(errors matter little since the baseline is cheap, ~0.3ns/iter). The suspender
baseline (cost of BenchmarkSuspender usage) runs every 2 rounds (expensive,
~35ns/iter).
Measurement slices: Each benchmark runs for ~bm_slice_usec per sample
(default 1ms). Long enough to amortize harness overhead and avoid timing noise
from function call boundaries.
Self-calibrating iteration counts
Each benchmark starts at iterCount=1 and self-adjusts after every sample so
that each measurement slice takes ~bm_slice_usec. This avoids a separate
calibration phase, which was susceptible to cold-start effects (page faults,
TLB misses) that could lock iterCount at 1 for the entire run.
Note:
bm_slice_usecshould be high enough to avoid interference from the harness code. The default of 1ms is the reasonable minimum -- 100μs already shows serious interference.
Interleaved sampling
while not converged:
measure baseline (every 8 rounds)
measure suspender baseline (every 2 rounds)
for each benchmark (random order):
measure calibrated iterations, record sample, recalibrate
check convergence every 150ms
Are we done?
For each benchmark:
Phase 4: Report
Subtract baselines, pass to existing display logic.