agents/docs/research/wave_pde_alternatives.md
Mission scope: Resolve #3111.
Survey numerical-methods + real-time-graphics literature for techniques that improve the visual quality (or reduce the CPU cost) of fl::WaveSimulation2D / WaveFx versus the current super-sample-based scalar leapfrog approach.
Working assumption — quantified: The current approach pays an O(k³) super-sample tax (k² cells × k updates per displayed frame). At SUPER_SAMPLE_4X that is 64× the 1× scalar cost, and the marginal visual gain past SUPER_SAMPLE_2X is essentially decoupled-from-physics anti-aliasing of the output. If there's an algorithmic 10×+ win available it is in how we anti-alias, not in how we propagate.
Author's standing: I am working from FastLED's actual target constraints (AVR through ESP32-S3 PSRAM, 8×8–256×256 grids, integer Q15, no FFTW / Eigen / OpenCV, no FPU on most targets) and have weighted recommendations heavily by cycles per cell per displayed frame on a Cortex-M4 baseline. I have explicitly not weighted by "the most beautiful math" — this is an LED-art problem.
| Family | Best candidate | Δ visual @ 1× grid vs SS-4× baseline | Δ cost vs 1× scalar baseline | Embed score (worst → best of 5 platforms) |
|---|---|---|---|---|
| Higher-order spatial stencils | Compact 9-point O(h⁴) (§5-like geometry, §B-style weights) | comparable | 1.0–1.3× | 4 / 5 (AVR limited) |
| Output anti-aliasing filter | 3×3 separable Gaussian or single-pass SMAA on the display buffer | comparable on smooth fields | 1.05–1.2× | 5 / 5 |
| Dispersion-corrected (NS-FDTD) | Stencil-coefficient tuning at design frequency | comparable–better at "design" wave speeds | 1.0–1.1× | 4 / 5 |
| Implicit time integration | ADI w/ Thomas-algorithm tridiagonal solve | comparable | 2.5–4× (but supports CFL > 1) | 2 / 5 |
| Spectral / FFT | Tessendorf-style FFT propagation | better visually | 8–30× per displayed frame on M4 (no AVR) | 0 / 5 (AVR/M0 ruled out) |
| Lattice Boltzmann (LBM) | D2Q5 acoustic LBM | comparable | 1.3–1.8× | 3 / 5 |
| Procedural (non-PDE) | Sum-of-Gerstner-waves / Tessendorf-sin-spectrum | different look (no reflections / no boundary interaction) but very pleasing | 0.3–2× depending on N_waves | 5 / 5 |
| Adaptive mesh refinement (AMR) | Active-region tracking + sparse update | identical to scalar baseline | 0.1–1× (best when grid is mostly idle) | 4 / 5 |
Short-list (3 of 5) for immediate implementation:
WaveProceduralFx opt-in mode that synthesizes ripples as a sum of N Gerstner / damped-sinusoid wavelets instead of integrating a PDE. For AVR / Low-memory targets where neither the scalar PDE nor the compact O(h⁴) version fits well, this is dramatically cheaper and visually compelling on coarse grids. See §F below.The remaining 2 short-list slots (ADI implicit, Tessendorf FFT on ESP32-S3+) are conditional — they win on specific platforms but lose on the matrix average, so they should land only if a specific user demand surfaces.
All candidates are scored on each of the five target platform families. 5 = ships today as-is, 1 = will not fit / will not run at frame rate, 0 = formally impossible.
| Platform | RAM budget for grid | Native int | HW divider | Notes |
|---|---|---|---|---|
| AVR (Uno / ATmega328) | ≤ 2 KB | 8-bit | no | 8-bit int, software 16/32-bit ops, no FPU |
| Cortex-M0 (RP2040, Teensy-LC, LPC8xx) | ≤ 32 KB | 32-bit | no | No HW divide; multiply ≈ 1 cycle |
| Cortex-M4/M7 (Teensy 3.x/4.x) | 256 KB–1 MB | 32-bit | yes | DSP intrinsics; FPU on M4F/M7 |
| ESP32 / ESP32-S3 / -C6 | 320 KB SRAM + ≥ 2 MB PSRAM | 32-bit | yes | PIE SIMD on S3; PSRAM is slow (~80 ns/access) |
| Host (WASM / native test) | unbounded | 32/64-bit | yes | SSE2/AVX2/NEON; reference for visual A/B |
A score of 3 or above on every platform is the bar for a default-on change. A score of 3 or above on M4+ESP32 is the bar for an opt-in change. Anything that scores 0 anywhere is a per-platform #ifdef opt-in only.
Premise: Super-sample's primary benefit is averaging-out the spatial aliasing of the rendered wave field, not improving PDE accuracy. If we anti-alias at render time we sidestep O(k²) of the cost wholesale.
WaveFx is rarely used at hard edges.9N reads + 9N mul-add + N writes per frame at the displayed resolution. About 1.05–1.2× the scalar baseline depending on whether the colormap is applied before or after.N × sizeof(u8) line buffer for the horizontal pass.WaveFx::setUseChangeGrid + blur infrastructure in src/fl/fx/2d/blend.h / fl/gfx/blur.cpp.hpp — adding a post-process Gaussian is wiring, not new code.Wave2d (100×100, gradient palette) to confirm visual quality at σ ≈ 0.7 LED-pixel widths.N × sizeof(u8)) and motion-vector reprojection, which doesn't exist for a static-camera wave field.Premise: #3104 added the 9-point isotropic Laplacian, which gives O(h²) accuracy with isotropic leading error. The literature has 9-point stencils that achieve O(h⁴) at the same memory access pattern — strictly better.
∇²u ≈ 1/(6h²) · [ 4·(N+S+E+W) + (NW+NE+SW+SE) - 20·C ]. Reads the same 9 cells as the existing isotropic stencil but with weights tuned for O(h⁴) accuracy (the isotropic stencil's weights — 1/6 and 2/3 — are tuned for isotropy, not order).O(h⁴) — invisible at LED-grid resolutions, much smaller than current 9-point's O(h²).wave_simulation_real.cpp.hpp:268 (post-#3104).×20) which on 8-bit takes a few cycles; but otherwise identical to current 9-point.[1, 4, -20]/6 mean the Laplacian magnitude is 6× larger pre-normalization than the current 9-point. The existing i64-promoted Q15 multiply in wave_simulation_real.cpp.hpp already handles this (verified for the ~8.6e9 worst case in #3083) — just need to confirm the new constant doesn't push past i32 even with i64 intermediate. Bench: run the saturated-checkerboard test at compact-4 weights and verify no UBSAN shift-of-negative warnings.h and Δt in the stencil with frequency-dependent correction functions tuned for a specific "design" wave speed. Near-zero numerical dispersion at that speed.setSpeed() time.setSpeed() call would change visual character of waves at off-design speeds. Worth investigating only if WaveFx adopts a fixed nominal speed.~1.5–2× the per-cell reads compared to B1, sharply diminishing returns on visual quality.Premise: explicit leapfrog requires Δt² ≤ h²/c² (CFL). Implicit methods are unconditionally stable — one time step can cover many wavelengths.
O(N) Thomas-algorithm tridiagonal sweeps.CFL > 1 stably so each step propagates further.4× the wavefront distance (CFL = 2), net cost is lower. Net break-even depends on the visual frame budget.N-wide temporary buffer per axis pass.WaveFx changes — wavefronts move faster per simulation step, so the existing Speed UI slider semantics shift. Would need a one-line conversion in the slider readout.O(N^1.5) via banded LU) — strictly worse than ADI for our problem topology.e^(±ick t). Take FFT of u(t), multiply each mode by its propagation factor, inverse FFT. No dispersion, no stencil error.N×N grid, two 2D FFTs per step = O(N² log N). On Cortex-M4 the CMSIS-DSP 2048-point FFT takes ~1.2 ms (Q15) — so a 64×64 2D = 64 × 64-point row FFTs + same column FFTs ≈ 8–15 ms per step on M4. Painful but feasible at 30 fps for 64×64.2 × N². For 64×64 Q15 = 16 KB — fits ESP32, painful on M4.setf(x, y, 1.0) to drop a "pebble" produces a different visual signature than the scalar PDE does. Would feel different to LED-art users.WaveSimulation2D. Worth a follow-up if D1 is pursued.Q=5 distribution functions (one per discrete velocity). Per step: stream (move each distribution to its neighbor along its velocity vector), then collide (relax toward equilibrium).5N reads + 5N writes per step + per-cell O(1) collision math. Per-cell cost is ~1.5–1.8× the current scalar update.5 × i16 per cell instead of the current 2 × i16 (grid1 + grid2). For 100×100: 100 KB instead of 40 KB. AVR-disqualifying.Premise: for LED art, we are not bound by the wave equation — we are bound by "ripples look pretty". Procedural ripple synthesis is dramatically cheaper than any PDE.
WaveProceduralFx)u(x, y, t) = Σᵢ Aᵢ · damp(t - tᵢ) · ripple(|p - cᵢ|, t - tᵢ, kᵢ) for N active wavelets, each spawned at a "drop" location.WaveFx::setf(x, y, 1.0) in the existing API), allocate a wavelet i with center cᵢ = (x, y), birth time tᵢ, frequency kᵢ. Render is a sum-of-rings.O(N_wavelets × N_cells) per frame. With N_wavelets ≤ 8 active at once (typical "raindrop" demo) and 100×100 grid: 80,000 evaluations per frame at maybe 20 cycles each on M4 = 1.6M cycles = ~7 ms at 240 MHz. Faster than the current scalar PDE on the same hardware.~64 bytes × N_wavelets of wavelet state. Zero grid memory.N_wavelets ≤ 4) · M0 5 · M4 5 · ESP32 5 · Host 5.WaveFx will notice the lack of boundary reflections. Worth offering as a parallel WaveProceduralFx class rather than a replacement.O(N_modes × N_cells). Worse asymptotically than F1; only viable for very small grids.O(N_active_cells) per step. For a typical LED-art scene with 2–3 active wavefronts on a 100×100 grid, active cells are maybe 10–20% of total. 5–10× speedup.N/64 bits for tile-level activity bitmap.mUseChangeGrid optimization in wave_simulation.h is already exactly this pattern; needs only an "all-zero tile" fast path.i8 (Q7) instead of i16 (Q15), promote to Q15 during the multiply.WaveSimulation2D_Real8 parallel class or a templated Storage parameter. Worth a focused sub-issue.Covered in §A1 (Gaussian) and discussed below for MLAA/SMAA:
The following were investigated and ruled out. Documenting these explicitly saves the next research run from re-walking them.
| Method | Why ruled out |
|---|---|
| TAA (§A2) | No motion vectors in a static-camera wave field |
| MIP-style decimate (§A3) | Folded into A1 |
| 13-point/17-point stencils (§B3) | Diminishing returns past O(h⁴) at LED resolution |
| Crank-Nicolson fully-implicit (§C2) | Banded LU dominated by ADI |
| LBM D2Q5 (§E1) | Memory cost rules out AVR/M0 |
| Tessendorf-direct sum-of-sines (§F2) | Dominated by sum-of-Gerstner in F1 |
| Perlin-noise wave field (§F3) | Different aesthetic; orthogonal |
| Posit numbers (§H2) | No embedded impl |
| SMAA/FXAA (§J1) | Wrong tool for 16×16-grid scale |
If the project agrees with this report, file the following as focused sub-issues (the way #3098 spawned #3092–#3097):
perf(wave): add compact O(h⁴) Laplacian as a third stencil option — §B1 above. Drop-in addition to the LaplacianStencil enum from #3104. Pairs with disabling super-sample as the default for medium-size grids. Highest-priority — smallest diff, biggest visual / cycle gain.feat(wave): add output Gaussian blur as an alternative to super-sampling — §A1 above. Two new UI controls: OutputBlur (off / 1 / 2 passes) and SuperSample (default NONE once this lands). Big educational win for users who currently default to SS-4× without knowing why.feat(fx): add WaveProceduralFx — sum-of-Gerstner-wavelets ripple synthesizer — §F1 above. New Fx2d class, no PDE. Largest design footprint but highest end-user value, especially on AVR.perf(wave): tile-level active-region tracking in WaveSimulation2D update — §G1 above. Extends the existing useChangeGrid infrastructure. Win is biggest on ESP32-S3 with PSRAM-backed large grids.perf(wave): Q7 grid storage option for memory-constrained targets — §H1 above. Templated Storage = i16 | i8 parameter on the inner classes.SIMD work (#3106 / #3107 / #3108) should remain on hold — items 1, 2, and 4 above all reduce the kernel's hot-path cycle cost by more than SIMD typically delivers, and the algorithmic wins compose with later SIMD work. Pick the algorithm before tuning the constant factor.
Cited inline above; reproduced here for cross-reference:
Compact / high-order finite difference:
Isotropic Laplacian operators:
ADI / implicit:
Lattice Boltzmann:
Tessendorf / FFT:
Gerstner / procedural:
TAA / post-process AA:
Embedded FFT:
Closes #3111.