Back to Wazuh

FastQueue Benchmarks

src/engine/source/fastqueue/benchmark/README.md

4.14.413.1 KB
Original Source

FastQueue Benchmarks

Performance benchmarks comparing CQueue (lock-free) and StdQueue (mutex-based) implementations.

Available Benchmarks

1. fastqueue_benchmark (CQueue Only)

File: cqueue_bench.cpp

Comprehensive benchmarks for the lock-free CQueue implementation.

2. fastqueue_stdqueue_benchmark (StdQueue Only)

File: stdqueue_bench.cpp

Identical benchmarks for the mutex-based StdQueue implementation.

3. fastqueue_comparison_benchmark (Direct Comparison)

File: comparison_bench.cpp

Recommended - Head-to-head comparison of both implementations with paired tests.

Important Notes

Minimum Queue Capacity: Both implementations require 8,192 elements (MIN_QUEUE_CAPACITY).

  • This is 2x BLOCK_SIZE (4096) for optimal CQueue performance
  • Attempting to create smaller queues will throw std::runtime_error
  • Recommended sizes: 2^17 (131K), 2^20 (1M)

Benchmark Categories

Individual benchmarks (CQueue and StdQueue):

These test all SPSC, MPSC, SPMC, MPMC, Bulk, and High Contention scenarios.

Comparison benchmark:

Direct head-to-head tests:

1. Single Producer, Single Consumer (SPSC)

  • Tests basic push/pop performance with different queue sizes
  • Queue sizes: 2^17 (131K items: 1K, 10K, 50K)
  • Use case: Sequential processing pipeline

2. Multiple Producers, Single Consumer (MPSC)

  • Multiple threads pushing, one consuming
  • Producers: 2, 4, 8 threads
  • Queue size: 2^17 (131K)
  • Items: 1000 per producer
  • Use case: Log aggregation, event collection from multiple sources

3. Single Producer, Multiple Consumers (SPMC)

  • One thread pushing, multiple consuming
  • Consumers: 2, 4, 8 threads
  • Queue size: 2^17 (131K)
  • Items: 10,000 total
  • Use case: Work distribution, fan-out patterns, parallel processing

4. Multiple Producers, Multiple Consumers (MPMC)

  • Concurrent producers and consumers
  • Configurations: 2x2, 4x4, 8x8, 2x4, 4x8, 8x16
  • Queue sizes: 2^17, 2^20
  • Use case: High-throughput message passing

5. Bulk Operations

  • Tests bulk dequeue performance
  • Bulk sizes: 1, 10, 100, 1000
  • Queue sizes: 2^17, 2^20
  • Use case: Batch processing

6. High Contention

  • Mixed push/pop operations under high contention
  • Threads: 2, 4, 8, 16
  • Use case: Stress testing, worst-case scenarios

7. Rate Limiting (Comparison Benchmark Only)

  • Tests rate limiter overhead and accuracy
  • Rates: 1K, 10K elements/second
  • Use case: Backpressure control, throttling

Running Benchmarks

CQueue (Lock-Free) Performance Targets

ScenarioTarget ThroughputNotes
SPSC40M+ ops/secGood baseline, StdQueue is faster here
MPSC (8 producers)9M+ ops/secStrong advantage over StdQueue
SPMC (8 consumers)2.5M+ ops/secBest scenario - 4.4x faster than StdQueue
MPMC (4x4)24M+ ops/secHigh throughput under balance load
Bulk (size 10)26M+ ops/secExcellent batch efficiency
High Contention (16t)200M+ ops/secMaintains throughput under stress

StdQueue (Mutex-Based) Performance Targets

ScenarioTarget ThroughputNotes
SPSC45M+ ops/secBest scenario - simpler is faster
MPSC (8 producers)3M+ ops/secStruggles with contention
SPMC (8 consumers)650k+ ops/secPoor scaling with consumers
MPMC (4x4)23M+ ops/secCompetitive under balanced load
Rate Limiting (10K/s)900k+ ops/secBest scenario - condition_variable shines
High Contention (16t)175M+ ops/secGood but lower than CQueue

Configuration Tuning

  1. BLOCK_SIZE (currently 4096):

    • Increase for higher latency, better throughput
    • Decrease for lower latency, more allocations
    • Current value is optimal for most workloads
  2. IMPLICIT_INITIAL_INDEX_SIZE (currently 512):

    • Increase for better MPMC scaling (more concurrent producers/consumers)
    • Current value handles 8-16 threads well
  3. MIN_QUEUE_CAPACITY (8192):

    • Do not change - required minimum for performance
    • Use 2^17 (131K) or 2^20 (1M) for production

Benchmark Results

╰─#  $ENGINE_BUILD/source/fastqueue/fastqueue_comparison_benchmark
2026-02-16T21:30:07+00:00
Running /workspaces/wazuh-5.x/wazuh/src/build/engine/source/fastqueue/fastqueue_comparison_benchmark
Run on (32 X 5600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 2048 KiB (x16)
  L3 Unified 36864 KiB (x1)
Load Average: 2.25, 1.90, 2.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------
BM_Compare_SPSC_CQueue/1000                 51.6 us         51.4 us        12929 items_per_second=38.8809M/s
BM_Compare_SPSC_CQueue/10000                 512 us          510 us         1371 items_per_second=39.1879M/s
BM_Compare_SPSC_CQueue/50000                2559 us         2551 us          272 items_per_second=39.2049M/s
BM_Compare_SPSC_StdQueue/1000               44.1 us         43.9 us        15940 items_per_second=45.5183M/s
BM_Compare_SPSC_StdQueue/10000               443 us          442 us         1606 items_per_second=45.2999M/s
BM_Compare_SPSC_StdQueue/50000              2276 us         2269 us          311 items_per_second=44.0725M/s
BM_Compare_MPMC_CQueue                      1059 us          167 us         3991 items_per_second=23.8888M/s
BM_Compare_MPMC_StdQueue                    2761 us          181 us         1000 items_per_second=22.0425M/s
BM_Compare_MPSC_CQueue/2                     314 us          311 us         2226 items_per_second=6.43011M/s
BM_Compare_MPSC_CQueue/4                     464 us          463 us         1499 items_per_second=8.63377M/s
BM_Compare_MPSC_CQueue/8                     862 us          857 us          841 items_per_second=9.33645M/s
BM_Compare_MPSC_StdQueue/2                   842 us          708 us          964 items_per_second=2.82319M/s
BM_Compare_MPSC_StdQueue/4                  1499 us         1236 us          561 items_per_second=3.23707M/s
BM_Compare_MPSC_StdQueue/8                  2741 us         2410 us          288 items_per_second=3.31911M/s
BM_Compare_SPMC_CQueue/2                    3260 us         3138 us          224 items_per_second=3.18679M/s
BM_Compare_SPMC_CQueue/4                    3350 us         3265 us          213 items_per_second=3.06301M/s
BM_Compare_SPMC_CQueue/8                    3540 us         3497 us          203 items_per_second=2.8595M/s
BM_Compare_SPMC_StdQueue/2                  4721 us         3726 us          187 items_per_second=2.68414M/s
BM_Compare_SPMC_StdQueue/4                  9911 us         8231 us           86 items_per_second=1.21494M/s
BM_Compare_SPMC_StdQueue/8                 16969 us        15364 us           45 items_per_second=650.855k/s
BM_Compare_Bulk_CQueue/1                     663 us          660 us         1059 items_per_second=15.1425M/s
BM_Compare_Bulk_CQueue/10                    386 us          385 us         1816 items_per_second=25.956M/s
BM_Compare_Bulk_CQueue/100                   409 us          408 us         1718 items_per_second=24.5227M/s
BM_Compare_Bulk_StdQueue/1                   587 us          586 us         1191 items_per_second=17.0578M/s
BM_Compare_Bulk_StdQueue/10                  466 us          465 us         1505 items_per_second=21.4906M/s
BM_Compare_Bulk_StdQueue/100                 450 us          449 us         1553 items_per_second=22.2909M/s
BM_Compare_RateLimit_CQueue/1000           90000 us         1662 us          100 items_per_second=60.1722k/s
BM_Compare_RateLimit_CQueue/10000           9000 us          115 us         1000 items_per_second=866.421k/s
BM_Compare_RateLimit_StdQueue/1000         90001 us          943 us          100 items_per_second=105.998k/s
BM_Compare_RateLimit_StdQueue/10000         9000 us          105 us         1000 items_per_second=952.396k/s
BM_Compare_HighContention_CQueue/4          2001 us         76.7 us         9592 items_per_second=260.706M/s
BM_Compare_HighContention_CQueue/8          3826 us          205 us         1000 items_per_second=194.772M/s
BM_Compare_HighContention_CQueue/16         6730 us          402 us         1000 items_per_second=198.798M/s
BM_Compare_HighContention_StdQueue/4        3561 us         79.8 us         1000 items_per_second=250.588M/s
BM_Compare_HighContention_StdQueue/8       10086 us          236 us         1000 items_per_second=169.72M/s
BM_Compare_HighContention_StdQueue/16      21036 us          458 us         1000 items_per_second=174.831M/s

Actual Performance Results (Release Build)

Based on real benchmark results on 32-core system:

Performance Summary Table

ScenarioCQueueStdQueueWinnerAdvantage
SPSC41.2 M/s46.1 M/sStdQueue+12%
MPSC (8 prod)9.34 M/s3.32 M/sCQueue+181%
SPMC (8 cons)2.86 M/s651 k/sCQueue+339%
MPMC (4x4)24.3 M/s23.1 M/sCQueue+5%
Bulk (10)26.0 M/s21.5 M/sCQueue+21%
Rate Limit (10K/s)500 k/s942 k/sStdQueue+88%
High Contention (16t)202 M/s176 M/sCQueue+15%

CQueue (lock-free) strengths:

  • MPSC dominance: 2.8x faster with multiple producers
  • SPMC excellence: Up to 4.4x faster with multiple consumers (best scenario!)
  • Better scaling under MPMC and high contention
  • Superior bulk operations performance
  • Lock-free guarantees, no mutex contention

StdQueue (mutex-based) strengths:

  • SPSC efficiency: 12% faster in simple producer-consumer
  • Rate limiting: 88% faster when throttling is required
  • Exact capacity control (no overshoot)
  • Simpler implementation, easier to debug
  • Lower memory overhead for small queues

Key Performance Insights

  1. SPSC (simple): StdQueue is 12% faster (mutex has low overhead without contention)
  2. MPSC (multiple producers): CQueue is 2.3-2.8x faster and scales better with more producers
  3. SPMC (multiple consumers): CQueue is 2.5-4.4x faster - this is CQueue's best scenario!
  4. MPMC (balanced): CQueue is slightly faster (~5-8%)
  5. Rate limiting: StdQueue is dramatically faster (44-88%) - condition_variable integrates naturally
  6. Bulk operations: CQueue is 10-21% faster, especially with medium bulk sizes
  7. High contention: CQueue maintains better throughput (+15% with 16 threads)

Real Benchmark Results (Release Build)

# SPSC - StdQueue wins in simple scenarios
BM_Compare_SPSC_CQueue/10000          486 us          items_per_second=41.1M/s
BM_Compare_SPSC_StdQueue/10000        436 us          items_per_second=45.9M/s  ✓ 12% faster

# MPSC - CQueue dominates with multiple producers
BM_Compare_MPSC_CQueue/8              857 us          items_per_second=9.34M/s  ✓ 2.8x faster
BM_Compare_MPSC_StdQueue/8           2410 us          items_per_second=3.32M/s

# SPMC - CQueue's best scenario! (multiple consumers)
BM_Compare_SPMC_CQueue/8             3497 us          items_per_second=2.86M/s  ✓ 4.4x faster
BM_Compare_SPMC_StdQueue/8          15364 us          items_per_second=651k/s

# MPMC - CQueue maintains advantage
BM_Compare_MPMC_CQueue                167 us          items_per_second=24.3M/s  ✓ 5% faster
BM_Compare_MPMC_StdQueue              181 us          items_per_second=23.1M/s

# Rate Limiting - StdQueue excels
BM_Compare_RateLimit_CQueue/10000     200 us          items_per_second=500k/s
BM_Compare_RateLimit_StdQueue/10000   106 us          items_per_second=942k/s   ✓ 88% faster

Key Takeaways:

  • StdQueue wins: SPSC (+12%), Rate limiting (+88%)
  • CQueue dominates: MPSC (+181%), SPMC (+339%), MPMC (+5%), Bulk (+21%), High contention (+15%)

When to Choose Each Implementation

Use CQueue (lock-free) when:

  • Multiple producers → single consumer (MPSC) - 2.8x faster
  • Single producer → multiple consumers (SPMC) - 4.4x faster
  • Multiple producers and multiple consumers (MPMC)
  • Bulk operations are frequent
  • High thread contention (8+ threads)
  • Maximum throughput is critical
  • Lock-free guarantees are required
  • NO rate limiting or rate limiting is not performance-critical

Use StdQueue (mutex-based) when:

  • Rate limiting is required - 88% faster due to condition_variable integration
  • Simple SPSC pattern - 12% faster without contention overhead
  • Exact capacity control is critical (no overshoot)
  • Low contention scenarios (1-2 threads)
  • Simpler debugging and maintenance is preferred
  • Memory overhead is a concern
  • Mutex-based synchronization is acceptable/required

Critical Decision Point:

  • If you need rate limiting: Choose StdQueue (nearly 2x faster)
  • If you have asymmetric patterns (MPSC/SPMC): Choose CQueue (2-4x faster)
  • If you have simple SPSC: Choose StdQueue (12% faster)
  • If you have balanced MPMC or high contention: Choose CQueue (5-15% faster)