docs/adr/ADR-062-qemu-swarm-configurator.md
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-03-14 |
| Authors | RuView Team |
| Relates | ADR-061 (QEMU testing platform), ADR-060 (channel/MAC filter), ADR-018 (binary frame), ADR-039 (edge intel) |
| Term | Definition |
|---|---|
| Swarm | A group of N QEMU ESP32-S3 instances running simultaneously |
| Topology | How nodes are connected: star, mesh, line, ring |
| Role | Node function: sensor (collects CSI), coordinator (aggregates + forwards), gateway (bridges to host) |
| Scenario matrix | Cross-product of topology × node count × NVS config × mock scenario |
| Health oracle | Python process that monitors all node UART logs and declares swarm health |
ADR-061 Layer 3 provides a basic multi-node mesh test: N identical nodes with sequential TDM slots connected via a Linux bridge. This is useful but limited:
Real WiFi-DensePose deployments use 3-8 ESP32-S3 nodes in various topologies. A single coordinator aggregates CSI from multiple sensors. The firmware must handle TDM conflicts, missing nodes, role-based behavior differences, and network partitions — none of which ADR-061 Layer 3 tests.
Build a QEMU Swarm Configurator — a YAML-driven tool that defines multi-node test scenarios declaratively and orchestrates them under QEMU with swarm-level validation.
┌─────────────────────────────────────────────────────┐
│ swarm_config.yaml │
│ nodes: [{role: sensor, scenario: 2, channel: 6}] │
│ topology: star │
│ duration: 60s │
│ assertions: [all_nodes_boot, tdm_no_collision, ...] │
└──────────────────────┬──────────────────────────────┘
│
┌────────────▼────────────┐
│ qemu_swarm.py │
│ (orchestrator) │
└───┬────┬────┬───┬──────┘
│ │ │ │
┌────▼┐ ┌▼──┐ ▼ ┌▼────┐
│Node0│ │N1 │... │N(n-1)│ QEMU instances
│sens │ │sen│ │coord │
└──┬──┘ └─┬─┘ └──┬───┘
│ │ │
┌──▼──────▼─────────▼──┐
│ Virtual Network │ TAP bridge / SLIRP
│ (topology-shaped) │
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Aggregator (Rust) │ Collects frames
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Health Oracle │ Swarm-level assertions
│ (swarm_health.py) │
└──────────────────────┘
# swarm_config.yaml
swarm:
name: "3-sensor-star"
duration_s: 60
topology: star # star | mesh | line | ring
aggregator_port: 5005
nodes:
- role: coordinator
node_id: 0
scenario: 0 # empty room (baseline)
channel: 6
edge_tier: 2
is_gateway: true # receives aggregated frames
- role: sensor
node_id: 1
scenario: 2 # walking person
channel: 6
tdm_slot: 1 # TDM slot index (auto-assigned from node position if omitted)
- role: sensor
node_id: 2
scenario: 3 # fall event
channel: 6
tdm_slot: 2
assertions:
- all_nodes_boot
- no_crashes
- tdm_no_collision
- all_nodes_produce_frames
- coordinator_receives_from_all
- fall_detected_by_node_2
- frame_rate_above: 15 # Hz minimum per node
- max_boot_time_s: 10
| Topology | Network | Description |
|---|---|---|
star | All sensors connect to coordinator; coordinator has TAP to each sensor | Hub-and-spoke, most common |
mesh | All nodes on same bridge (existing Layer 3 behavior) | Every node sees every other |
line | Node 0 ↔ Node 1 ↔ Node 2 ↔ ... | Linear chain, tests multi-hop |
ring | Like line but last connects to first | Circular, tests routing |
| Role | Behavior | NVS Keys |
|---|---|---|
sensor | Runs mock CSI, sends frames to coordinator | node_id, tdm_slot, target_ip |
coordinator | Receives frames from sensors, runs edge aggregation | node_id, tdm_slot=0, edge_tier=2 |
gateway | Like coordinator but also bridges to host UDP | node_id, target_ip=host, is_gateway=1 |
| Assertion | What It Checks |
|---|---|
all_nodes_boot | Every node's UART log shows boot indicators within timeout |
no_crashes | No Guru Meditation, assert, panic in any log |
tdm_no_collision | No two nodes transmit in the same TDM slot |
all_nodes_produce_frames | Every sensor node's log contains CSI frame output |
coordinator_receives_from_all | Coordinator log shows frames from each sensor's node_id |
fall_detected_by_node_N | Node N's log reports a fall detection event |
frame_rate_above | Each node produces at least N frames/second |
max_boot_time_s | All nodes boot within N seconds |
no_heap_errors | No OOM or heap corruption in any log |
network_partitioned_recovery | After deliberate partition, nodes resume communication (future) |
| Preset | Nodes | Topology | Purpose |
|---|---|---|---|
smoke | 2 | star | Quick CI smoke test (15s) |
standard | 3 | star | Default 3-node (sensor + sensor + coordinator) |
large-mesh | 6 | mesh | Scale test with 6 fully-connected nodes |
line-relay | 4 | line | Multi-hop relay chain |
ring-fault | 4 | ring | Ring with fault injection mid-test |
heterogeneous | 5 | star | Mixed scenarios: walk, fall, static, channel-sweep, empty |
ci-matrix | 3 | star | CI-optimized preset (30s, minimal assertions) |
scripts/
├── qemu_swarm.py # Main orchestrator (CLI entry point)
├── swarm_health.py # Swarm-level health oracle
└── swarm_presets/
├── smoke.yaml
├── standard.yaml
├── large_mesh.yaml
├── line_relay.yaml
├── ring_fault.yaml
├── heterogeneous.yaml
└── ci_matrix.yaml
.github/workflows/
└── firmware-qemu.yml # MODIFIED: add swarm test job
0xC5110001)