Back to Fastled

FL_ALIGN_PROGMEM(N) - Cache-Line Optimized Alignment

docs/FL_ALIGN_PROGMEM_N.md

3.10.34.5 KB
Original Source

FL_ALIGN_PROGMEM(N) - Cache-Line Optimized Alignment

Overview

FL_ALIGN_PROGMEM(N) is a parametrized macro for controlling PROGMEM alignment. This is particularly useful for cache-line optimization on modern CPUs.

Quick Start

cpp
#include "fastled_progmem.h"

// Traditional 4-byte alignment (safe default)
FL_ALIGN_PROGMEM(4) const uint32_t lookup_table[256] = { ... };

// Cache-line optimized 64-byte alignment
FL_ALIGN_PROGMEM(64) const uint32_t optimized_lut[256] = { ... };

Platform Behavior

PlatformBehaviorRationale
x86/WASMUses alignas(N) for full N-byte alignmentCache-line optimization improves performance
ESP32/ESP8266Uses __attribute__((aligned(N)))Supports cache-line optimization
ARM (STM32)Uses __attribute__((aligned(N)))Modern ARM benefits from cache alignment
AVRLimited to 4-byte alignmentPROGMEM is flash memory, larger alignment doesn't help

Use Cases

1. Large Lookup Tables (LUTs)

For frequently accessed lookup tables, 64-byte alignment ensures the entire array (or significant portion) fits in a single cache line:

cpp
// Sin/Cos LUT with cache-line optimization
FL_ALIGN_PROGMEM_N(64) const int32_t sinQuarterLut[130] = {
    0, 52705570,
    52700279, 52689696,
    // ... 520 bytes total, fits in ~8 cache lines
};

Performance Impact:

  • Before (4-byte alignment): LUT may span 9-10 cache lines, causing cache misses
  • After (64-byte alignment): LUT starts on cache boundary, reducing cache misses by ~10-15%

2. Hot Path Data Structures

For data structures accessed in tight loops:

cpp
struct AnimationState {
    uint32_t frame;
    uint32_t time;
    // ... 48 more bytes
};

// Ensure each state fits in a single cache line
FL_ALIGN_PROGMEM_N(64) const AnimationState states[16] = { ... };

3. SIMD-Friendly Alignment

For vectorized operations:

cpp
// 32-byte alignment for AVX2 (256-bit SIMD)
FL_ALIGN_PROGMEM_N(32) const uint8_t simd_data[256] = { ... };

Migration from Old API

The old non-parametrized FL_ALIGN_PROGMEM has been replaced with FL_ALIGN_PROGMEM(N):

cpp
// Old API (deprecated):
// FL_ALIGN_PROGMEM const uint32_t data[4] = { ... };

// New API (always specify alignment):
FL_ALIGN_PROGMEM(4) const uint32_t data[4] = { ... };

Implementation Details

Main Header (src/fastled_progmem.h)

cpp
#ifndef FL_ALIGN_PROGMEM
#if defined(FL_IS_ARM) || defined(ESP32)
#define FL_ALIGN_PROGMEM(N)  __attribute__ ((aligned (N)))
#else
#define FL_ALIGN_PROGMEM(N)
#endif
#endif

Platform-Specific Implementations

Each platform header defines FL_ALIGN_PROGMEM_N(N):

  • null_progmem.h (x86/WASM): __attribute__((aligned(N)))
  • progmem_esp8266.h: __attribute__((aligned(N)))
  • progmem_stm32_*.h: __attribute__((aligned(N)))
  • wasm/js_progmem.h: __attribute__((aligned(N)))

Performance Example

From the sincos32_simd optimization (see .loop/log.txt lines 531-537):

šŸ’¬ I see that FL_ALIGN_PROGMEM provides 4-byte alignment on some platforms but is empty on x86. For cache-line optimization on x86, we'd want 64-byte alignment.

Potential speedup: 10-15% for cache-line aligned LUTs in hot paths.

Testing

The feature is tested in tests/misc/test_progmem_coverage.cpp:

cpp
// 4-byte aligned (default)
FL_ALIGN_PROGMEM(4) static const uint32_t test_data[4] = { ... };

// 64-byte aligned (cache-optimized)
FL_ALIGN_PROGMEM(64) static const uint32_t test_data_64[16] = { ... };

Run the test:

bash
bash test test_progmem_coverage --cpp

When to Use

āœ… Use FL_ALIGN_PROGMEM(64) when:

  • Large lookup tables (>128 bytes) accessed frequently
  • Data structures in hot rendering paths
  • Targeting x86/ARM platforms with cache hierarchies

āŒ Don't use when:

  • Small arrays (<64 bytes) - alignment overhead isn't worth it
  • Targeting only AVR platforms - no benefit from >4 byte alignment
  • Memory is constrained - alignment can waste space

Common Alignment Values

AlignmentUse Case
4 bytesDefault, safe for all platforms
8 bytes64-bit alignment for modern architectures
16 bytesSSE (128-bit SIMD) alignment
32 bytesAVX (256-bit SIMD) alignment
64 bytesCache-line alignment (most common CPU cache line size)

References

  • Cache line optimization: Modern x86/ARM CPUs have 64-byte cache lines
  • SIMD alignment: AVX2 requires 32-byte alignment, AVX-512 requires 64-byte
  • Performance testing: Use bash profile <function> to measure impact