docs/profiling/HOW_TO_PROFILE.md
This guide documents the profiling workflow used for optimizing Chasing Spirals Q31 (7.9% speedup achieved). Follow these steps to profile your own code changes.
# 1. Create a profiler in tests/fx/profile_myfeature.cpp
# 2. Compile with profile mode
uv run test.py profile_myfeature --cpp --build-mode profile --build
# 3. Run 20 iterations for statistics
for i in {1..20}; do .build/meson-profile/tests/profile_myfeature.exe; done
# 4. Run accuracy tests
uv run test.py myfeature --cpp
# 5. Analyze results and iterate
Create tests/fx/profile_myfeature.cpp:
#include "FastLED.h"
#include <chrono>
#include <cstdio>
using namespace fl;
static const uint16_t W = 32;
static const uint16_t H = 32;
static const uint16_t N = W * H;
static const int WARMUP_FRAMES = 20;
static const int PROFILE_FRAMES = 200;
// Noinline ensures profiler can target this function
__attribute__((noinline))
void renderVariant1(MyFx &fx, CRGB *leds, int frames, int start_frame) {
for (int i = 0; i < frames; i++) {
uint32_t t = static_cast<uint32_t>((start_frame + i) * 16);
Fx::DrawContext ctx(t, leds);
fx.draw(ctx);
}
}
__attribute__((noinline))
void renderVariant2(MyFx &fx, CRGB *leds, int frames, int start_frame) {
// Your optimized variant here
}
int main(int argc, char *argv[]) {
bool do_variant1 = true;
bool do_variant2 = true;
// CLI argument parsing
if (argc > 1) {
do_variant1 = false;
do_variant2 = false;
if (::strcmp(argv[1], "variant1") == 0) {
do_variant1 = true;
} else if (::strcmp(argv[1], "variant2") == 0) {
do_variant2 = true;
} else {
do_variant1 = true;
do_variant2 = true;
}
}
CRGB leds[N] = {};
// ========================
// Variant 1 (Baseline)
// ========================
if (do_variant1) {
XYMap xy = XYMap::constructRectangularGrid(W, H);
MyFx fx(xy, /* config */);
// Warmup (not profiled)
renderVariant1(fx, leds, WARMUP_FRAMES, 0);
// Profile
auto t0 = std::chrono::high_resolution_clock::now();
renderVariant1(fx, leds, PROFILE_FRAMES, WARMUP_FRAMES);
auto t1 = std::chrono::high_resolution_clock::now();
auto elapsed_us = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
printf("Variant1: %d frames in %lld us (%.1f us/frame)\n",
PROFILE_FRAMES, (long long)elapsed_us,
(double)elapsed_us / PROFILE_FRAMES);
}
// ========================
// Variant 2 (Optimized)
// ========================
if (do_variant2) {
// Similar structure to Variant 1
// ...
}
return 0;
}
./profiler variant2)FastLED supports 4 build modes optimized for different use cases:
uv run test.py profile_myfeature --cpp --build-mode profile --build
Flags: -Os -g
.build/meson-profile/tests/profile_myfeature.exeuv run test.py profile_myfeature --cpp --build-mode quick --build
Flags: -O1 -g1
uv run test.py profile_myfeature --cpp --build-mode debug --build
Flags: -O0 -g3 -fsanitize=address,undefined
uv run test.py profile_myfeature --cpp --build-mode release --build
Flags: -O2 -DNDEBUG
.build/meson-profile/tests/profile_myfeature.exe
# Run 20 iterations and extract timing
for i in {1..20}; do
.build/meson-profile/tests/profile_myfeature.exe | grep "Variant2"
done
Output Example:
Variant2: 200 frames in 14300 us (71.5 us/frame)
Variant2: 200 frames in 14480 us (72.4 us/frame)
Variant2: 200 frames in 14100 us (70.5 us/frame)
...
# Profile only variant2 (faster)
.build/meson-profile/tests/profile_myfeature.exe variant2
Create .loop/analyze_perf.py:
import sys
import statistics
values = []
for line in sys.stdin:
if "us/frame" in line:
# Extract float from "(71.5 us/frame)"
val = float(line.split("(")[1].split(" us/frame")[0])
values.append(val)
if values:
print(f"Best: {min(values):.1f} μs")
print(f"Median: {statistics.median(values):.1f} μs")
print(f"Worst: {max(values):.1f} μs")
print(f"StdDev: {statistics.stdev(values):.1f} μs")
Usage:
for i in {1..20}; do .build/meson-profile/tests/profile_myfeature.exe variant2; done | python .loop/analyze_perf.py
Example output from Chasing Spirals optimization:
| Implementation | Best | Median | Improvement |
|---|---|---|---|
| Q31 Original | 74.0 μs | 77.6 μs | Baseline |
| Q16 Optimized | 69.9 μs | 71.5 μs | +7.9% |
Key Metrics:
Run 20 iterations minimum to account for:
# Close unnecessary programs
# Let CPU cool down between runs
sleep 10
# Run with process priority (Windows)
start /high .build/meson-profile/tests/profile_myfeature.exe
# Linux: use nice/taskset for CPU pinning
taskset -c 0 nice -n -20 ./profile_myfeature
CRITICAL: Every optimization MUST pass accuracy tests before being accepted.
Add to tests/fx/myfeature.cpp:
#include "test.h"
namespace {
const uint16_t W = 32;
const uint16_t H = 32;
const uint16_t N = W * H;
void renderBaseline(CRGB *leds, uint32_t timeMs) {
XYMap xy = XYMap::constructRectangularGrid(W, H);
MyFxBaseline fx(xy);
Fx::DrawContext ctx(timeMs, leds);
fx.draw(ctx);
}
void renderOptimized(CRGB *leds, uint32_t timeMs) {
XYMap xy = XYMap::constructRectangularGrid(W, H);
MyFxOptimized fx(xy);
Fx::DrawContext ctx(timeMs, leds);
fx.draw(ctx);
}
// Count mismatched pixels
int countMismatches(const CRGB *a, const CRGB *b, uint16_t count) {
int mismatches = 0;
for (uint16_t i = 0; i < count; i++) {
if (a[i] != b[i]) {
mismatches++;
}
}
return mismatches;
}
// Compute per-channel average absolute error
float computeAvgError(const CRGB *a, const CRGB *b, uint16_t count) {
int64_t total_error = 0;
for (uint16_t i = 0; i < count; i++) {
total_error += abs(int(a[i].r) - int(b[i].r));
total_error += abs(int(a[i].g) - int(b[i].g));
total_error += abs(int(a[i].b) - int(b[i].b));
}
return static_cast<float>(total_error) / (count * 3);
}
// Compute maximum per-channel error
int computeMaxError(const CRGB *a, const CRGB *b, uint16_t count) {
int max_err = 0;
for (uint16_t i = 0; i < count; i++) {
int er = abs(int(a[i].r) - int(b[i].r));
int eg = abs(int(a[i].g) - int(b[i].g));
int eb = abs(int(a[i].b) - int(b[i].b));
int m = er > eg ? er : eg;
m = m > eb ? m : eb;
if (m > max_err) max_err = m;
}
return max_err;
}
} // namespace
FL_TEST_CASE("MyFeature - low error at t=1000") {
CRGB leds_baseline[N] = {};
CRGB leds_optimized[N] = {};
renderBaseline(leds_baseline, 1000);
renderOptimized(leds_optimized, 1000);
int mismatches = countMismatches(leds_baseline, leds_optimized, N);
float avg_err = computeAvgError(leds_baseline, leds_optimized, N);
int max_err = computeMaxError(leds_baseline, leds_optimized, N);
FL_MESSAGE("t=1000: mismatches=", mismatches, "/", N,
" avg_err=", avg_err, " max_err=", max_err);
float error_pct = avg_err / 255.0f * 100.0f;
FL_MESSAGE("Average error at t=1000: ", error_pct, "%");
// Thresholds (adjust based on your tolerance)
FL_CHECK_MESSAGE(error_pct < 1.0f,
"Average error should be < 1% at t=1000");
FL_CHECK_MESSAGE(max_err <= 6,
"Max per-channel error should be <= 6 at t=1000");
}
FL_TEST_CASE("MyFeature - approximate at high time") {
const uint32_t times[] = {1000000, 100000000, 2000000000};
for (uint32_t high_time : times) {
CRGB leds_baseline[N] = {};
CRGB leds_optimized[N] = {};
renderBaseline(leds_baseline, high_time);
renderOptimized(leds_optimized, high_time);
float avg_err = computeAvgError(leds_baseline, leds_optimized, N);
int max_err = computeMaxError(leds_baseline, leds_optimized, N);
float error_pct = avg_err / 255.0f * 100.0f;
FL_MESSAGE("t=", high_time, ": avg_err=", avg_err,
" max_err=", max_err, " error_pct=", error_pct, "%");
FL_CHECK_MESSAGE(error_pct < 3.0f,
"Average error should be < 3% at high time values");
}
}
uv run test.py myfeature --cpp
For integer approximations of float code:
| Metric | Low Time (t=1000) | High Time (t>1M) |
|---|---|---|
| Avg error | < 1% | < 3% |
| Max error | ≤ 6 per channel | ≤ 10 per channel |
| Mismatches | Informational only | Informational only |
For exact implementations:
Docker provides a consistent Linux environment with sanitizers (ASAN/LSAN) for catching memory bugs.
# Run profiler in Docker (implies --debug with sanitizers)
bash test --docker tests/fx/profile_myfeature
# Fast mode (no sanitizers)
bash test --docker tests/fx/profile_myfeature --quick
# Release mode (optimized)
bash test --docker tests/fx/profile_myfeature --build-mode release
Compile your profiler for Docker environment:
# Compile in Docker
bash compile --docker tests/fx/profile_myfeature.cpp
# Then run manually
docker run -it --rm -v $(pwd):/workspace fastled-test \
/workspace/.build/meson-docker-debug/tests/profile_myfeature
Use Docker when:
Avoid Docker when:
--quick flag: Skip sanitizers for faster iterationOne change at a time:
# ❌ Bad: Combined changes
- Reduced precision from Q24 to Q16
- Added batching
- Changed color order
# Which change caused the speedup?
# ✅ Good: Incremental changes
- Baseline: Q24
- Test 1: Q24 with batching
- Test 2: Q16 without batching
- Test 3: Q16 with batching
Measure first, optimize second:
# 1. Profile to find bottleneck
valgrind --tool=callgrind ./profile_myfeature baseline
# 2. Analyze callgrind output
kcachegrind callgrind.out.12345
# 3. Optimize hot path (top 20% of time)
# 4. Verify speedup
Example: PERLIN_OPTIMIZATION_RESULTS.md
## Phase 1: Permutation Table Prefetching
**Goal**: Reduce pipeline stalls with __builtin_prefetch()
**Result**: 0% improvement (74.3 μs → 74.1 μs, measurement noise)
**Conclusion**: Hardware prefetching already optimal. Manual hints provide no benefit.
**Code reverted**: Yes
Why document failures?
Use same conditions for all tests:
Modern compilers (Clang 21.1.5) are excellent:
// ❌ Manual optimization that compiler beats
#define MANUAL_INLINE __attribute__((always_inline))
MANUAL_INLINE int lerp(...) { ... }
// ✅ Let compiler decide
FASTLED_FORCE_INLINE int lerp(...) { ... }
From Chasing Spirals optimization:
Compiler already optimized:
# Create feature branch for optimization work
git checkout -b optimize-chasing-spirals
# Commit each variant separately
git commit -m "Add Q16 Perlin variant (16 fractional bits)"
git commit -m "Add Batch4 color-grouped variant"
# Easy to revert if needed
git revert HEAD~1
Complete workflow from the real optimization that achieved 7.9% speedup:
tests/fx/profile_chasing_spirals.cpp:
uv run test.py profile_chasing_spirals --cpp --build-mode profile --build
# Run 20 iterations for each variant
for i in {1..20}; do
.build/meson-profile/tests/profile_chasing_spirals.exe q31 | grep "Q31 (original)"
done
for i in {1..20}; do
.build/meson-profile/tests/profile_chasing_spirals.exe q16 | grep "Q16"
done
| Implementation | Best | Median | vs Original |
|---|---|---|---|
| Q31 Original | 74.0 μs | 77.6 μs | Baseline |
| Q31 Batch4 | 70.5 μs | 74.5 μs | +4.0% |
| Q31 Batch4 Color | 70.8 μs | 74.0 μs | +4.6% |
| Q16 Batch4 Color | 69.9 μs | 71.5 μs | +7.9% ✅ |
uv run test.py animartrix2 --cpp
Q16 Results:
Created comprehensive documentation:
PERLIN_OPTIMIZATION_RESULTS.md - Failed attempts (Phases 1-3)CHASING_SPIRALS_PERFORMANCE_AUDIT.md - Performance breakdownOPTIMIZATION_SUMMARY.md - Executive summaryUpdated src/fl/fx/2d/animartrix2.hpp line 117:
{ANIM2_CHASING_SPIRALS, "CHASING_SPIRALS",
&q16::Chasing_Spirals_Q16_Batch4_ColorGrouped},
Windows access violation (0xC0000005). Common causes:
Null pointer dereference
// Check your pointers!
if (e->mCtx->leds == nullptr) { /* handle */ }
Buffer overflow
// Check array bounds
FL_CHECK(i < total_pixels, "Index out of bounds");
Type overflow
// ❌ i16 HP_ONE = 1 << 16; // Overflow!
// ✅ i32 HP_ONE = 1 << 16; // Correct
Debug with:
# Run in debug mode with sanitizers
uv run test.py profile_myfeature --cpp --debug
Check compiler optimization flags
# Use profile mode, not debug
--build-mode profile
Verify warmup frames are sufficient
static const int WARMUP_FRAMES = 20; // Prime caches
Run more iterations (reduce measurement noise)
for i in {1..50}; do ./profiler; done
Check CPU throttling
# Let CPU cool down
sleep 60
# Run with leak detection
uv run test.py myfeature --cpp --debug
# Output shows:
# LEAK SUMMARY: 64 bytes in 1 allocation
Fix:
// Use RAII (fl::vector, fl::unique_ptr) instead of raw pointers
fl::vector<int> data(100); // Auto-cleanup
For deep performance analysis:
# Compile with profile mode
uv run test.py profile_myfeature --cpp --build-mode profile --build
# Run under callgrind
valgrind --tool=callgrind \
--callgrind-out-file=callgrind.out \
.build/meson-profile/tests/profile_myfeature baseline
# Analyze with kcachegrind (GUI)
kcachegrind callgrind.out
# Or text analysis
callgrind_annotate callgrind.out
Look for:
Before declaring an optimization complete:
--build-mode profileRemember: Measure twice, optimize once. Profile-guided optimization beats guesswork every time.