Back to Burn

Benchmarks: Flex vs NdArray

crates/burn-flex/BENCHMARKS.md

0.21.036.8 KB
Original Source

Benchmarks: Flex vs NdArray

All benchmarks run on Apple M3 Max, comparing burn-flex against burn-ndarray. Default features enabled (std, simd, rayon); gemm is a required dependency.

Date: 2026-04-06

How to Read

  • Median time reported (lower is better)
  • Speedup = NdArray median / Flex median
  • Mem = peak allocation (max alloc from divan)
  • Bold speedup means Flex wins; plain means tie or NdArray wins

Binary Operations (f32)

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
add4K389 ns436 ns1.1x16.4 KB16.4 KB
add64K7.36 us7.45 us~1x262 KB262 KB
add1M83.9 us115 us1.4x4.19 MB4.19 MB
mul4K382 ns430 ns1.1x16.4 KB16.4 KB
mul64K7.40 us7.40 us~1x262 KB262 KB
mul1M115 us115 us~1x4.19 MB4.19 MB
div1M115 us115 us~1x4.19 MB4.19 MB
add_scalar1M78.7 us87.8 us1.1x4.19 MB4.19 MB
mul_scalar1M75.8 us87.5 us1.2x4.19 MB4.19 MB
powf64K197 us199 us~1x262 KB262 KB
powf1M3.17 ms3.21 ms~1x4.19 MB4.19 MB
powf_scalar1M3.23 ms3.18 ms~1x4.19 MB4.19 MB
atan264K143 us142 us~1x262 KB262 KB
atan21M2.33 ms2.32 ms~1x4.19 MB4.19 MB

Transposed

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
add256x25648.5 us46.0 us0.95x262 KB262 KB
add1024x10241.00 ms990 us~1x4.19 MB4.19 MB

Binary Operations (i64)

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
int_add4K361 ns655 ns1.8x16.5 KB32.8 KB
int_add64K7.40 us14.6 us2.0x262 KB524 KB
int_add1M115 us230 us2.0x4.19 MB8.39 MB
int_mul4K366 ns1.95 us5.3x16.4 KB32.8 KB
int_mul64K7.40 us26.7 us3.6x262 KB524 KB
int_mul1M115 us230 us2.0x4.19 MB8.39 MB
int_div1M604 us698 us1.2x4.19 MB8.39 MB
int_add_scalar1M75.8 us174 us2.3x4.19 MB8.39 MB
int_mul_scalar1M75.7 us258 us3.4x4.19 MB8.39 MB

Int Power

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
int_powi256x25695.6 us83.1 us0.87x262 KB524 KB
int_powi1024x256336 us382 us1.1x1.05 MB2.10 MB

Transposed (i64)

OperationSizeFlexNdArraySpeedup
int_add256x25655.6 us50.5 us0.91x
int_add1024x1024996 us1.10 ms1.1x

Int Cast

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
i64 to i8256x2563.19 us20.2 us6.3x65.6 KB65.6 KB
i64 to i3264x6416.8 ns1.37 us82x16.0 B16.4 KB
i64 to i32256x25613.6 ns20.0 us~1475x16.0 B262 KB
i64 to i321024x102413.6 ns352 us~25963x16.0 B4.19 MB

Int Random

OperationSizeFlexNdArraySpeedup
uniform64x6420.9 us31.6 us1.5x
uniform256x256334 us510 us1.5x
uniform1024x10245.36 ms8.16 ms1.5x
uniform16x128x1281.35 ms2.04 ms1.5x

Matrix Multiplication

Square (f32)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
64x646.06 us18.9 us3.1x33.6 KB49.3 KB
128x12843.8 us41.8 us~1x328 KB197 KB
256x256166 us138 us0.83x524 KB786 KB
512x512579 us840 us1.4x2.10 MB3.15 MB
1024x10242.69 ms5.83 ms2.2x8.39 MB12.6 MB

Rectangular (f32)

ShapeFlexNdArraySpeedup
512x64 x 64x512167 us144 us0.87x
256x512 x 512x256265 us266 us~1x
128x1024 x 1024x128190 us199 us~1x

Transposed (256x256)

ConfigFlexNdArraySpeedup
LHS transposed140 us174 us1.2x
RHS transposed158 us173 us1.1x
Both transposed165 us210 us1.3x

Batched (f32)

ShapeFlexNdArraySpeedup
8x 64x6457.6 us76.7 us1.3x
32x 64x6467.0 us111 us1.7x
16x 128x128267 us540 us2.0x
12x 512x64 (heads)777 us1.71 ms2.2x

Broadcast (f32)

ShapeFlexNdArraySpeedup
[1,64,64] x [8,64,64]47.9 us79.8 us1.7x
[8,64,64] x [1,64,64]54.0 us78.5 us1.5x
[2,1,32,32] x [1,4,32,32]7.00 us40.0 us5.7x
[4,1,64,64] x [1,4,64,64]50.0 us66.4 us1.3x

Integer (i32)

SizeFlexNdArraySpeedup
64x6430.2 us110 us3.7x
128x128196 us971 us4.9x
256x2561.90 ms10.1 ms5.3x
512x51218.3 ms119 ms6.5x

Slice Operations

Basic Slicing

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
slice 1D1K112 ns235 ns2.1x56.0 B2.15 KB
slice 1D1M105 ns26.4 us~252x8.75 B2.10 MB
slice 2D256x256120 ns3.61 us30x19.0 B65.7 KB
slice 2D1024x1024115 ns31.9 us~278x17.5 B1.05 MB
slice 3D64x64x64148 ns16.0 us~108x28.5 B131 KB

Narrow

OperationSizeFlexNdArraySpeedup
narrow dim0256x256141 ns1.70 us12x
narrow dim01024x1024142 ns26.0 us~183x
narrow dim1256x256128 ns6.84 us53x

Slice Assignment

OperationSizeFlexNdArraySpeedup
assign 1D1K304 ns395 ns1.3x
assign 2D256x2565.39 us5.69 us1.1x
assign 2D1024x102474.9 us74.8 us~1x

Transposed Slicing

SizeFlexNdArraySpeedup
256x25698.2 ns7.98 us81x
1024x102498.2 ns232 us~2363x

Slice with Step

OperationSizeFlexNdArraySpeedup
step2 1D1K87.1 ns360 ns4.1x
step2 1D1M76.7 ns142 us~1849x
step2 2D1024x1024103 ns86.8 us~845x
step4 2D256x256101 ns2.65 us26x

Concatenation

Cat (dim 0, contiguous memcpy fast path)

TensorsSizeFlexNdArraySpeedupFlex MemNdArray Mem
4x256x25616.0 us33.1 us2.1x1.05 MB2.10 MB
4x1024x25657.9 us132 us2.3x4.20 MB8.39 MB
16x64x644.11 us11.5 us2.8x265 KB528 KB
4x16K (1D)3.47 us10.1 us2.9x263 KB525 KB

Cat (dim 1, general path)

TensorsSizeFlexNdArraySpeedupFlex MemNdArray Mem
4x256x646.92 us59.2 us8.6x263 KB525 KB
4x1024x6425.5 us366 us14x1.05 MB2.10 MB

Dim-1 cat is much faster because NdArray's default uses N slice_assign calls while Flex copies contiguous chunks directly.


Reduce Operations

Full Tensor Sum

SizeFlexNdArraySpeedupFlex MemNdArray Mem
1K118 ns156 ns1.3x76.2 B44.0 B
64K3.23 us6.20 us1.9x80.0 B44.0 B
1M43.3 us97.0 us2.2x84.0 B44.0 B

Full Tensor Max

SizeFlexNdArraySpeedupFlex MemNdArray Mem
1K207 ns662 ns3.2x76.2 B44.0 B
64K8.78 us32.5 us3.7x84.0 B44.0 B
1M139 us558 us4.0x84.0 B44.0 B

Full Tensor Min

SizeFlexNdArraySpeedupFlex MemNdArray Mem
1K278 ns579 ns2.1x84.0 B44.0 B
64K9.15 us34.8 us3.8x84.0 B44.0 B
1M142 us540 us3.8x84.0 B44.0 B

Int Max

SizeFlexNdArraySpeedupFlex MemNdArray Mem
256x2562.84 us9.12 us3.2x84.0 B48.0 B
1024x102442.2 us145 us3.4x92.0 B48.0 B

Sum Along Dimension

ShapeDimFlexNdArraySpeedup
256x25605.06 us11.4 us2.3x
256x25612.77 us4.61 us1.7x
1024x1024080.0 us100 us1.3x
1024x1024142.2 us82.0 us1.9x

3D Sum (Batched)

ShapeDimFlexNdArraySpeedup
32x256x2561156 us212 us1.4x
32x256x256286.2 us134 us1.6x

Sum Transposed

SizeFlexNdArraySpeedup
256x2563.27 us6.20 us1.9x
1024x102441.4 us96.7 us2.3x

Sum Dim on Transposed

SizeDimFlexNdArraySpeedup
256x25602.79 us4.40 us1.6x
1024x1024041.5 us81.9 us2.0x

Mean Along Dimension

ShapeDimFlexNdArraySpeedup
256x25612.90 us4.53 us1.6x
1024x1024142.5 us82.4 us1.9x

Argmax

ShapeDimFlexNdArraySpeedup
1K-752 ns4.09 us5.4x
256x256166.7 us242 us3.6x
1024x10241120 us3.98 ms33x

Cumulative Operations

Cumsum

ShapeDimFlexNdArraySpeedup
1K0838 ns65.7 us78x
64K045.2 us4.25 ms94x
1M0719 us68.1 ms95x
256x256011.3 us34.2 us3.0x
256x256142.6 us215 us5.0x
1024x10241709 us5.51 ms7.8x

Cumprod

ShapeDimFlexNdArraySpeedup
1K01.27 us66.0 us52x
256x256166.3 us216 us3.3x

Cummin

ShapeDimFlexNdArraySpeedup
1K01.79 us66.3 us37x
256x2561102 us204 us2.0x
1024x102411.71 ms5.53 ms3.2x

Cummax

ShapeDimFlexNdArraySpeedup
1K01.82 us65.9 us36x
256x2561102 us123 us1.2x
1024x102411.70 ms3.60 ms2.1x

3D Cumsum (Batched)

ShapeDimFlexNdArraySpeedup
32x64x64124.1 us83.7 us3.5x
32x64x64268.0 us237 us3.5x

Gather/Scatter Operations

Gather

ShapeDimFlexNdArraySpeedupFlex MemNdArray Mem
256x256032.9 us140 us4.3x393 KB786 KB
256x256133.8 us87.1 us2.6x393 KB786 KB
1024x10241273 us1.31 ms4.8x6.29 MB12.6 MB

Scatter Add

ShapeDimFlexNdArraySpeedupFlex MemNdArray Mem
256x256135.7 us189 us5.3x524 KB918 KB
1024x10241563 us2.83 ms5.0x8.39 MB14.7 MB

Select

ShapeDimFlexNdArraySpeedupFlex MemNdArray Mem
256x25602.02 us12.9 us6.4x132 KB143 KB
256x256126.3 us31.1 us1.2x132 KB143 KB
1024x1024026.8 us88.1 us3.3x2.10 MB2.15 MB

Bool Select

ShapeIndicesFlexNdArraySpeedupFlex MemNdArray Mem
256x256128935 ns12.0 us13x33.9 KB45.0 KB
1024x2565122.89 us47.4 us16x135 KB180 KB

Select Add

ShapeDimFlexNdArraySpeedupFlex MemNdArray Mem
256x25607.35 us13.5 us1.8x263 KB263 KB
1024x10240103 us126 us1.2x4.20 MB4.20 MB

Unary Operations

Basic Math

OperationSizeFlexNdArraySpeedup
exp4K5.07 us5.20 us~1x
exp64K80.5 us85.0 us1.1x
exp1M1.31 ms1.35 ms~1x
log4K6.74 us6.87 us~1x
log64K106 us111 us~1x
log1M1.72 ms1.77 ms~1x
sqrt4K612 ns860 ns1.4x
sqrt64K9.03 us12.8 us1.4x
sqrt1M142 us195 us1.4x
abs1M75.8 us75.8 us~1x
recip1M75.5 us75.6 us~1x

Trigonometric

OperationSizeFlexNdArraySpeedup
sin4K5.65 us8.04 us1.4x
sin64K89.2 us130 us1.5x
sin1M1.45 ms2.10 ms1.4x
cos4K6.57 us8.45 us1.3x
cos1M1.68 ms2.21 ms1.3x
tanh4K7.07 us13.7 us1.9x
tanh64K112 us222 us2.0x
tanh1M1.80 ms3.57 ms2.0x

Transposed (Non-contiguous)

OperationSizeFlexNdArraySpeedup
exp256x25680.1 us84.8 us1.1x
exp1024x10241.31 ms1.35 ms~1x

Comparison & Boolean Operations

Tensor-Tensor Comparisons

OperationSizeFlexNdArraySpeedupFlex MemNdArray Mem
greater4K431 ns398 ns~1x4.17 KB4.14 KB
greater64K6.48 us5.73 us~1x65.6 KB65.6 KB
greater1M93 us88 us~1x1.05 MB1.05 MB
equal4K433 ns403 ns~1x4.17 KB4.14 KB
equal1M86 us89 us~1x1.05 MB1.05 MB
lower1M92 us87 us~1x1.05 MB1.05 MB

Scalar Comparisons

OperationSizeFlexNdArraySpeedup
greater_elem1M56 us76 us1.36x

Transposed Comparisons

OperationSizeFlexNdArraySpeedup
greater256x25653.6 us43.7 us0.82x
greater1024x1024985 us990 us~1x

Broadcast Comparisons

OperationShapeFlexNdArraySpeedup
greater256x2567.98 us25.6 us3.2x
greater1024x1024120 us317 us2.6x

Expand (Broadcasting)

OperationFlexNdArraySpeedup
1x1 to 1000x1000126 ns291 us~2307x
1024x1 to 1024x1024110 ns310 us~2803x
1x1024 to 1024x1024126 ns78.6 us~623x

Boolean Operations

OperationSizeFlexNdArraySpeedup
bool_not1M24.1 us19.0 us0.79x
bool_and1M34.6 us28.8 us0.83x

Convolutions

Kernel Size Comparison (4x64x56x56, 64 to 128 channels)

KernelFlexNdArraySpeedup
1x1577 us803 us1.4x
3x33.65 ms9.46 ms2.6x
5x58.05 ms24.7 ms3.1x
7x715.7 ms49.8 ms3.2x

ResNet Layers (batch=1, 3x3)

LayerInputChannelsFlexNdArraySpeedup
conv11x3x224x2243 to 64 (k7s2)954 us1.26 ms1.3x
layer11x64x56x5664 to 64986 us1.82 ms1.8x
layer21x128x28x28128 to 1281.08 ms1.60 ms1.5x
layer31x256x14x14256 to 2561.65 ms3.08 ms1.9x
layer41x512x7x7512 to 5122.71 ms10.3 ms3.8x

Small (batch=1, 3x3)

InputChannelsFlexNdArraySpeedup
1x3x32x323 to 1671.3 us79.4 us1.1x
1x16x32x3216 to 32219 us252 us1.2x
1x32x16x1632 to 64164 us338 us2.1x

Large Batched (batch=16, 3x3)

InputChannelsFlexNdArraySpeedup
16x64x128x12864 to 12879.8 ms180 ms2.3x
16x128x64x64128 to 25659.8 ms218 ms3.7x

Medium Batched (batch=8, 3x3)

InputChannelsFlexNdArraySpeedup
8x3x64x643 to 64925 us491 us0.53x
8x32x64x6432 to 644.67 ms6.44 ms1.4x
8x64x32x3264 to 1283.07 ms9.27 ms3.0x

Conv1d

InputKernelFlexNdArraySpeedup
1x16x256331.4 us163 us5.2x
8x32x5125536 us2.32 ms4.3x
16x64x102475.17 ms50.7 ms9.8x

Pooling

Max Pool 2D

InputKernelFlexNdArraySpeedup
1x64x56x563x3 s2135 us165 us1.2x
8x64x56x563x3 s2683 us914 us1.3x
16x128x28x282x2 s2406 us640 us1.6x
1x512x14x142x2 s290.5 us106 us1.2x

Max Pool 2D (ResNet)

InputKernelFlexNdArraySpeedup
1x64x112x1123x3 s2446 us520 us1.2x
8x64x112x1123x3 s22.63 ms3.00 ms1.1x
16x64x112x1123x3 s25.03 ms5.91 ms1.2x

Avg Pool 2D

InputKernelFlexNdArraySpeedup
1x64x56x563x3 s2155 us149 us~1x
8x64x56x563x3 s2782 us889 us1.1x
16x128x28x282x2 s2484 us480 us~1x

Adaptive Avg Pool 2D

InputOutputFlexNdArraySpeedup
1x256x56x567x7151 us142 us0.94x
1x512x7x71x163.7 us68.2 us1.1x
8x512x7x71x1112 us110 us~1x
16x2048x7x71x1286 us289 us~1x

Max Pool 1D

InputKernelFlexNdArraySpeedup
1x64x2563 s257.8 us79.4 us1.4x
8x128x5123 s2316 us828 us2.6x
16x256x10243 s21.73 ms5.41 ms3.1x

Kernel Size Comparison (4x64x56x56)

KernelFlexNdArraySpeedup
2x2221 us317 us1.4x
3x3375 us515 us1.4x
5x51.03 ms799 us0.78x

Transposed Convolutions

Conv Transpose 2D

InputOutputFlexNdArraySpeedup
1x64x7x714x14138 us1.67 ms12x
1x128x14x1428x28533 us12.9 ms24x
1x256x28x2856x562.49 ms209 ms84x
1x512x7x7 k3s17x71.01 ms52.6 ms52x
8x64x14x1428x283.41 ms49.7 ms15x

DCGAN Generator

LayerFlexNdArraySpeedupFlex MemNdArray Mem
1x1 to 4x4156 us1.43 ms9.2x33.1 KB16.4 KB
4x4 to 8x8234 us3.83 ms16x164 KB32.8 KB
8x8 to 16x16305 us4.24 ms14x852 KB65.6 KB
16x16 to 32x3226.4 us1.47 ms56x193 KB12.3 KB

Conv Transpose 1D

InputFlexNdArraySpeedup
1x64x3217.5 us383 us22x
8x128x64433 us9.02 ms21x
1x256x128337 us8.95 ms27x

Conv Transpose 3D

InputOutputFlexNdArraySpeedup
1x32x4x4x48x8x8245 us2.58 ms11x
1x64x8x8x816x16x161.41 ms47.4 ms34x

Interpolation

Nearest

InputOutputFlexNdArraySpeedup
1x3x64x64128x12822.7 us138 us6.1x
1x3x32x32128x12822.8 us143 us6.3x
1x3x256x256128x12822.5 us143 us6.3x
8x3x64x64128x12858.5 us307 us5.2x
1x64x32x3264x6457.9 us254 us4.4x

Bilinear

InputOutputFlexNdArraySpeedup
1x3x64x64128x12881.2 us161 us2.0x
1x3x32x32128x12886.5 us147 us1.7x
1x3x256x256128x12883.9 us157 us1.9x
8x3x64x64128x128181 us382 us2.1x
1x64x32x3264x64108 us309 us2.8x

Bicubic

InputOutputFlexNdArraySpeedup
1x3x64x64128x128159 us239 us1.5x
1x3x32x32128x128160 us232 us1.4x
1x3x256x256128x128159 us238 us1.5x
8x3x64x64128x128894 us994 us1.1x
1x64x32x3264x64616 us707 us1.1x

Grid Sample 2D

InputGridFlexNdArraySpeedupFlex MemNdArray Mem
1x3x32x3232x3216.7 us87.6 us5.2x12.5 KB12.3 KB
1x3x64x6464x6466.9 us127 us1.9x49.4 KB49.2 KB
4x3x32x3232x3260.7 us152 us2.5x49.4 KB49.2 KB
1x16x64x6464x64291 us223 us0.77x262 KB262 KB

Cross Product & Unfold

Cross Product

ShapeFlexNdArraySpeedup
1Kx331.1 us43.5 us1.4x
64Kx31.88 ms2.78 ms1.5x
256Kx37.58 ms11.1 ms1.5x
64x3x64141 us290 us2.1x

Unfold (1D)

InputWindowStepFlexNdArraySpeedup
1K8162.2 ns120 us~1927x
64K8164.1 ns7.55 ms~117686x
64K64162.2 ns8.04 ms~129295x
64K643262.2 ns254 us~4094x

Unfold (2D/3D)

ShapeDimWindowStepFlexNdArraySpeedup
256x25618168.7 ns871 us~12685x
256x2561321662.5 ns57.7 us~924x
1024x25618162.8 ns3.28 ms~52182x
32x64x6428477.1 ns424 us~5497x

Deformable Convolutions

Small/Tiny Inputs

InputConfigFlexNdArraySpeedup
1x3x8x83 to 8, k38.72 us92.6 us11x
1x3x8x8no mask7.98 us78.9 us9.9x
1x3x16x163 to 16, k336.4 us122 us3.4x
1x3x16x16stride 210.2 us80.8 us7.9x
2x8x16x168 to 16, k3116 us246 us2.1x

Medium Inputs

InputConfigFlexNdArraySpeedup
1x16x32x3216 to 32, k3826 us581 us0.70x
1x16x32x32wg=4840 us528 us0.63x
1x16x32x32og=4942 us606 us0.64x

Attention (Scaled Dot-Product)

Flex auto-selects between two gemm-backed strategies:

  • Naive (score matrix <= 256K elements): Materializes full [seq_q, seq_kv] score matrix. Two large gemm calls per (batch, head) amortize dispatch overhead better than many small tiled calls.
  • Flash (score matrix > 256K elements): Tiles over KV dimension with online softmax. O(seq_q * TILE_KV) memory per head instead of O(seq_q * seq_kv).

Both fuse scale + softcap + masking + bias + softmax into a single pass, reducing intermediate allocations from ~12 (NdArray fallback) to 3.

Self-Attention

ConfigFlexNdArraySpeedup
h8, s64, d64180 us534 us3.0x
h12, s128, d64989 us1.61 ms1.6x
h12, s256, d643.79 ms6.03 ms1.6x
h12, s512, d6414.8 ms22.7 ms1.5x
h32, s256, d12815.1 ms17.5 ms1.2x
b4, h12, s1283.96 ms5.47 ms1.4x

Causal Attention

ConfigFlexNdArraySpeedup
h12, s128, d641.01 ms1.72 ms1.7x
h12, s256, d643.82 ms6.47 ms1.7x
h12, s512, d6414.8 ms23.8 ms1.6x

With Additive Bias (ALiBi-style)

ConfigFlexNdArraySpeedup
h12, s128, d641.04 ms1.60 ms1.5x
h12, s256, d644.00 ms6.15 ms1.5x

Cross-Attention (seq_q != seq_k)

ConfigFlexNdArraySpeedup
sq128, sk512, d643.82 ms6.14 ms1.6x
sq32, sk1024, d642.05 ms3.72 ms1.8x

Quantized Tensor Operations

All quantized ops (except layout ops) go through a dequantize-op-quantize cycle. Flex stores scales separately and applies scale * x_q directly; NdArray reparses QuantizedBytes on every dequantize call, which dominates the cost.

Quantize (float to i8)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
4K6.93 us10.4 us1.5x20.6 KB24.7 KB
64K109 us145 us1.3x328 KB393 KB
1M1.75 ms2.31 ms1.3x5.24 MB6.29 MB

Dequantize (i8 to float)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
4K399 ns48.8 us~122x16.5 KB24.6 KB
64K3.73 us801 us~215x262 KB393 KB
1M54.6 us13.0 ms~238x4.19 MB6.29 MB

q_add (dequant + add + requant)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
4K1.15 us101 us88x20.6 KB41.0 KB
64K14.3 us1.61 ms~113x524 KB655 KB
1M208 us25.9 ms~125x8.39 MB10.5 MB

q_matmul (dequant + matmul + requant)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
64x647.10 us137 us19x66.5 KB49.3 KB
256x256147 us1.93 ms13x1.05 MB788 KB
512x512641 us8.07 ms13x4.19 MB3.15 MB

q_sum (dequant + sum)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
4K605 ns50.7 us84x2.13 KB24.6 KB
64K7.71 us834 us~108x262 KB393 KB
1M194 us13.3 ms68x4.19 MB6.29 MB

q_permute (zero-copy layout op)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
256x25675.5 ns66.1 ns0.88x20.5 B4.00 B
1024x102477.5 ns66.1 ns0.85x20.5 B4.00 B

q_argmax (operates on i8 directly)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
256x25667.9 us106 us1.6x3.25 KB4.16 KB
1024x1024137 us1.71 ms12x12.5 KB16.4 KB

q_argmin (operates on i8 directly)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
256x25669.4 us106 us1.5x3.25 KB4.16 KB
1024x1024139 us1.72 ms12x12.5 KB16.4 KB

q_gather (operates on i8 directly for tensor-level quant)

SizeFlexNdArraySpeedupFlex MemNdArray Mem
256x25665.7 us155 us2.4x590 KB721 KB
1024x1024393 us2.40 ms6.1x9.44 MB11.5 MB

Default Ops (sort, repeat, creation, embedding, predicates)

These ops override burn's default trait implementations with direct storage operations.

Sort (f32, 1D)

SizeFlexNdArraySpeedup
4K53.6 us123 us2.3x
64K593 us1.59 ms2.7x
1M8.47 ms23.9 ms2.8x

Sort (f32, 2D along last dim)

SizeFlexNdArraySpeedup
64x645.72 us70.3 us12x
256x256200 us1.26 ms6.3x
1024x10241.17 ms34.5 ms29x

Argsort (f32, 1D)

SizeFlexNdArraySpeedup
4K70.2 us123 us1.7x
1M12.8 ms26.2 ms2.1x

Repeat Dim (f32, 256x256)

ConfigFlexNdArraySpeedup
dim0 4x12.7 us134 us11x
dim1 4x11.5 us138 us12x
dim0 8x (512x512)98.4 us880 us8.9x

Tensor Creation (f32, 1M elements)

OperationFlexNdArraySpeedup
zeros17.7 us579 us33x
ones35.7 us579 us16x
full35.9 us579 us16x

Arange (i64)

SizeFlexNdArraySpeedup
4K1.21 us1.21 us~1x
1M299 us280 us0.94x

Embedding (f32)

ConfigFlexNdArraySpeedup
30k vocab, d=512, 8x12826.1 us150 us5.8x
50k vocab, d=768, 4x25638.9 us188 us4.8x

Predicates (f32, 1M elements)

OperationFlexNdArraySpeedup
is_nan46.4 us73.6 us1.6x
is_inf52.6 us147 us2.8x

FFT (Real FFT)

Compared against realfft v3 (backed by rustfft v6), the gold-standard pure-Rust FFT library. NdArray does not implement rfft. realfft requires std; Flex works in no_std.

1D rfft

SizeFlex (median)realfft (median)Ratio
n=256841 ns252 ns3.3x
n=10242.64 us991 ns2.7x
n=409610.6 us4.37 us2.4x
n=1638445.4 us20.7 us2.2x
n=65536231 us91.2 us2.5x

Batched 2D rfft (along last dim)

SizeFlex (median)realfft (median)Ratio
16 x 102459.9 us15.9 us3.8x
64 x 1024114 us64.0 us1.8x
256 x 256191 us64.9 us2.9x

1D irfft (inverse)

SizeFlex (median)realfft (median)Ratio
n=2561.32 us207 ns6.4x
n=10244.01 us908 ns4.4x
n=409615.6 us4.24 us3.7x
n=1638463.2 us21.1 us3.0x
n=65536278 us93.0 us3.0x

Flex implementation: Cooley-Tukey with mixed radix-4/radix-2, complex packing (forward) / inverse packing (inverse), compile-time twiddle tables via const fn, SIMD vectorization via macerator, unrolled small kernels (N=2,4,8), and rayon parallelism across fibers. The remaining gap to rustfft is due to their hand-tuned per-arch SIMD rewrites, split-radix algorithms, and strength-reduced modular arithmetic.


Running Benchmarks

bash
cargo bench --bench attention
cargo bench --bench binary_ops
cargo bench --bench matmul
cargo bench --bench int_ops
cargo bench --bench slice_ops
cargo bench --bench reduce_ops
cargo bench --bench cumulative_ops
cargo bench --bench gather_scatter_ops
cargo bench --bench unary_ops
cargo bench --bench comparison_ops
cargo bench --bench conv_ops
cargo bench --bench pool_ops
cargo bench --bench conv_transpose_ops
cargo bench --bench interpolate_ops
cargo bench --bench cross_unfold_ops
cargo bench --bench deform_conv_ops
cargo bench --bench quantization_ops
cargo bench --bench cat_max_min_ops
cargo bench --bench default_ops
cargo bench --bench fft_ops