Back to Hermes Agent

Performance Benchmarks

optional-skills/mlops/flash-attention/references/benchmarks.md

2026.6.57.0 KB
Original Source

Performance Benchmarks

Contents

  • Speed comparisons across GPUs
  • Memory usage analysis
  • Scaling with sequence length
  • Training vs inference performance
  • Flash Attention versions comparison

Speed comparisons across GPUs

A100 80GB (Ampere)

Forward pass time (milliseconds, batch=8, heads=32, dim=64):

Seq LengthStandardFlash Attn 2Flash Attn 3Speedup (FA2)
5121.20.9N/A1.3x
10243.81.4N/A2.7x
204814.24.8N/A3.0x
409655.117.3N/A3.2x
8192218.566.2N/A3.3x

H100 80GB (Hopper)

Forward pass time (milliseconds, same config):

Seq LengthStandardFlash Attn 2Flash Attn 3 (FP16)Flash Attn 3 (FP8)Best Speedup
5120.80.60.40.32.7x
10242.61.00.60.46.5x
20489.83.42.01.37.5x
409638.212.57.24.88.0x
8192151.447.827.118.28.3x

Key insight: Flash Attention 3 on H100 with FP8 achieves ~1.2 PFLOPS (75% of theoretical max).

A10G 24GB (Ampere)

Forward pass time (milliseconds, batch=4):

Seq LengthStandardFlash Attn 2Speedup
5122.11.61.3x
10246.82.82.4x
204825.99.42.8x
4096102.135.22.9x

Memory usage analysis

GPU memory consumption (batch=8, heads=32, dim=64)

Standard attention memory:

Seq LengthAttention MatrixKV CacheTotalNotes
5128 MB32 MB40 MBManageable
2048128 MB128 MB256 MBGrowing
81922048 MB (2 GB)512 MB2.5 GBLarge
3276832768 MB (32 GB)2048 MB34 GBOOM on 24GB GPUs

Flash Attention 2 memory:

Seq LengthAttention (on-chip)KV CacheTotalReduction
5120 MB (recomputed)32 MB32 MB20%
20480 MB128 MB128 MB50%
81920 MB512 MB512 MB80%
327680 MB2048 MB2 GB94%

Key insight: Flash Attention doesn't materialize attention matrix, saving O(N²) memory.

Memory scaling comparison

Llama 2 7B model memory (float16, batch=1):

Context LengthStandard AttentionFlash Attention 2Can Fit 24GB GPU?
2K3.2 GB2.1 GBBoth: Yes
4K5.8 GB2.8 GBBoth: Yes
8K12.1 GB4.2 GBBoth: Yes
16K26.3 GB (OOM)7.8 GBOnly Flash: Yes
32KOOM14.2 GBOnly Flash: Yes

Training memory (Llama 2 7B, batch=4)

ContextStandard (GB)Flash Attn (GB)Reduction
2K18.212.432%
4K34.816.852%
8KOOM (>40GB)26.2Fits!

Scaling with sequence length

Computational complexity

Standard attention:

  • Time: O(N² × d)
  • Memory: O(N² + N × d)

Flash Attention:

  • Time: O(N² × d) (same, but with better constants)
  • Memory: O(N × d) (linear!)

Empirical scaling (A100, batch=1, heads=32, dim=64)

Time per token (milliseconds):

Sequence5121K2K4K8K16K
Standard0.150.371.113.4413.452.8
Flash Attn 20.110.140.240.430.831.64
Speedup1.4x2.6x4.6x8.0x16.1x32.2x

Observation: Speedup increases quadratically with sequence length!

Memory per token (MB)

Sequence5121K2K4K8K16K
Standard0.080.130.250.642.058.13
Flash Attn 20.060.060.060.060.060.06

Observation: Flash Attention memory per token is constant!

Training vs inference performance

Training (forward + backward, Llama 2 7B, A100)

Batch × SeqStandard (samples/sec)Flash Attn (samples/sec)Speedup
4 × 2K1.23.12.6x
8 × 2K2.15.82.8x
4 × 4K0.41.33.3x
8 × 4KOOM2.4Enabled
2 × 8K0.10.44.0x

Inference (generation, Llama 2 7B, A100)

Context LengthStandard (tokens/sec)Flash Attn (tokens/sec)Speedup
51248521.1x
2K42621.5x
4K31581.9x
8K18512.8x
16KOOM42Enabled

Note: Inference speedup less dramatic than training because generation is memory-bound (KV cache accesses).

Flash Attention versions comparison

Flash Attention 1 vs 2 vs 3 (H100, seq=4096, batch=8)

MetricFA1FA2FA3 (FP16)FA3 (FP8)
Forward time (ms)28.412.57.24.8
Memory (GB)4.84.24.22.8
TFLOPS1804207401150
GPU util %35%55%75%82%

Key improvements:

  • FA2: 2.3x faster than FA1 (better parallelism)
  • FA3 (FP16): 1.7x faster than FA2 (H100 async optimizations)
  • FA3 (FP8): 2.6x faster than FA2 (low precision)

Features by version

FeatureFA1FA2FA3
Basic attention
Causal masking
Multi-query attention
Sliding window
Paged KV cache
FP8 support✅ (H100 only)
Work partitioningBasicAdvancedOptimal

Real-world model benchmarks

Llama 2 models (A100 80GB, batch=4, seq=2048)

ModelParamsStandard (samples/sec)Flash Attn (samples/sec)Speedup
Llama 2 7B7B1.23.12.6x
Llama 2 13B13B0.61.72.8x
Llama 2 70B70B0.120.342.8x

GPT-style models (seq=1024)

ModelStandard (tokens/sec)Flash Attn (tokens/sec)Speedup
GPT-2 (124M)5206801.3x
GPT-J (6B)42982.3x
GPT-NeoX (20B)8222.75x

Recommendations by use case

Training large models (>7B parameters):

  • Use Flash Attention 2 on A100
  • Use Flash Attention 3 FP8 on H100 for maximum speed
  • Expected: 2.5-3x speedup

Long context inference (>4K tokens):

  • Flash Attention essential (enables contexts standard attention can't handle)
  • Expected: 2-4x speedup, 5-10x memory reduction

Short sequences (<512 tokens):

  • Flash Attention provides 1.2-1.5x speedup
  • Minimal memory benefit
  • Still worth enabling (no downside)

Multi-user serving:

  • Flash Attention reduces per-request memory
  • Allows higher concurrent batch sizes
  • Can serve 2-3x more users on same hardware