Back to Fuel Core

Chaos Test — HA Leader Lock Failover

bin/chaos-test/README.md

0.48.14.8 KB
Original Source

Chaos Test — HA Leader Lock Failover

Standalone binary that continuously injects random faults into a local PoA cluster with Redis-based leader election, checking safety and liveness invariants. Think of it as fuzzing for distributed systems.

Prerequisites

  • redis-server must be on $PATH (brew install redis)
  • Builds with the rocksdb feature (included automatically)

Quick Start

bash
# Build
cargo build -p fuel-core-chaos-test

# Run with a specific seed (reproducible)
cargo run -p fuel-core-chaos-test -- --seed 42 --duration 60s

# Run with aggressive fault injection
cargo run -p fuel-core-chaos-test -- --seed 1337 --duration 90s \
  --fault-interval 500ms --block-time 100ms

# Random seed, default settings
cargo run -p fuel-core-chaos-test

Exit code 0 = pass, 1 = invariant violations found. The seed is printed at startup for reproduction.

What It Does

  1. Starts N PoA nodes (default 3), M Redis instances (default 3), and a P2P bootstrap relay
  2. Creates a per-(node, redis) TCP proxy grid (N x M = 9 proxies) so faults can be injected asymmetrically
  3. Nodes use persistent RocksDB via temp directories so restarts preserve chain state
  4. A fault scheduler randomly injects: network partitions, latency, mid-operation TCP drops, Redis kills, node kills — then auto-recovers
  5. An invariant checker continuously monitors for violations

Invariants Checked

InvariantMethodSeverity
No forksBlock stream: compare block IDs at each height across nodesCritical
No concurrent leadersBlock stream: detect multiple nodes locally producing at same heightCritical
No gapsDB polling: on_chain().latest_height() per node vs global maxSoft (60s tolerance)
No production stallsDB polling: global max height must advance within --stall-thresholdSoft (default 6s)

Gap and stall detection use direct RocksDB reads, not the block broadcast stream (which silently drops events when the consumer lags).

CLI Options

FlagDefaultDescription
--seedrandomRNG seed for reproducibility
--duration5mTotal test duration
--nodes3Number of PoA producer nodes
--redis-nodes3Number of Redis instances
--block-time200msBlock production interval (Trigger::Interval)
--fault-interval2sAverage time between fault injections (±50% jitter)
--stall-threshold6sMax allowed time with no blocks from any node
--log-levelinfoTracing filter (error, warn, info, debug)

Fault Types and Weights

CategoryWeightActions
Network partition25%PartitionNodeFromRedis, PartitionAllFromRedis
Latency injection20%AddLatency (50-500ms per direction)
Mid-operation drop15%CloseAfterBytes (kill after 10-500 bytes)
Redis kill/restart15%KillRedis / RestartRedis
Node kill/restart15%KillNode / RestartNode
Restore proxy5%RestoreProxy (single)
Restore all5%RestoreAllProxies

Safety constraints: never kill below Redis quorum, never kill all nodes, auto-schedule recovery 5-15s after destructive faults, revert any fault that breaks Redis quorum for all nodes.

Architecture

bin/chaos-test/src/
  main.rs          Orchestration: startup, fault/invariant spawn, settling, report
  cli.rs           clap CLI definition
  cluster.rs       Cluster lifecycle: Redis + proxy grid + PoA nodes (persistent RocksDB)
  proxy.rs         TCP proxy with switchable fault modes (Normal/DropAll/Latency/CloseAfterBytes)
  fault.rs         Weighted RNG fault scheduler with safety guards and auto-recovery
  invariants.rs    Fork detection (stream), gap/stall detection (DB polling)
  timeline.rs      Event log, violation types, final report
  redis_server.rs  Redis process management (spawn/stop/restart)

Mutation Testing

The harness can validate that known bugs are detectable:

bash
# Example: reintroduce the silent Redis read failure bug,
# then run the fuzzer to confirm it catches the resulting forks
cargo run -p fuel-core-chaos-test -- --seed 42 --duration 60s \
  --fault-interval 500ms --block-time 100ms
# Expected: FAIL with fork violations

Known Reproducing Seeds

SeedWhat it catchesFixed?
42Fork from silent Redis read failure (mutation test)Yes (quorum read check)
1337Production stall from 1s error sleep + lease retentionYes (release on error + block_time delay)
3, 200Production stall from post-restart ensure_synced delayOpen (P2P sync gate too slow)

Not Triggered by cargo test

This is a [[bin]] target, not a [[test]]. Running cargo test --all-features will not execute the chaos test.