Back to Ruview

ADR-072: WiFlow Pose Estimation Architecture

docs/adr/ADR-072-wiflow-architecture.md

0.7.09.2 KB
Original Source

ADR-072: WiFlow Pose Estimation Architecture

  • Status: Proposed
  • Date: 2026-04-02
  • Deciders: ruv
  • Relates to: ADR-071 (ruvllm Training Pipeline), ADR-070 (Self-Supervised Pretraining), ADR-024 (Contrastive CSI Embedding / AETHER), ADR-069 (Cognitum Seed CSI Pipeline)

Context

The WiFi-DensePose project needs a neural architecture that can convert raw CSI amplitude data into 17-keypoint COCO pose estimates. The existing train-ruvllm.js pipeline uses a simple 2-layer FC encoder (8 -> 64 -> 128) that produces contrastive embeddings for presence detection but cannot output spatial keypoint coordinates.

We evaluated published WiFi-based pose estimation architectures:

ArchitectureParamsInputKey InnovationPublication
WiFlow4.82M540x20TCN + AsymConv + Axial AttentionarXiv:2602.08661
WiPose11.2M3x3x30x203D CNN + heatmap regressionCVPR 2021
MetaFi++8.6M114x30x20Transformer + meta-learningNeurIPS 2023
Person-in-WiFi 3D15.3MMulti-antennaDeformable attention + 3DCVPR 2024

WiFlow is the lightest published SOTA architecture, designed specifically for commercial WiFi hardware. Its key advantage is operating on CSI amplitude only (no phase), which is critical for ESP32-S3 where phase calibration is unreliable.

Why WiFlow

  1. Lightest SOTA: 4.82M parameters at original scale; our adaptation targets ~2.5M
  2. Amplitude-only: Discards phase, which is noisy on consumer hardware
  3. Published architecture: Fully specified in arXiv:2602.08661, reproducible
  4. Temporal modeling: TCN with dilated causal convolutions captures motion dynamics
  5. Efficient attention: Axial attention reduces O(H^2W^2) to O(H^2W + HW^2)
  6. Proven on commercial WiFi: Validated on commodity Intel 5300 and Atheros hardware

Decision

Implement the WiFlow architecture in pure JavaScript (ruvllm native) with the following adaptations for our ESP32 single TX/RX deployment.

Architecture Overview

CSI Amplitude [128, 20]
        |
   Stage 1: TCN (Dilated Causal Conv)
   dilation = (1, 2, 4, 8), kernel = 7
   128 -> 256 -> 192 -> 128 channels
        |
   Stage 2: Asymmetric Conv Encoder
   1xk conv (k=3), stride (1,2)
   [1, 128, 20] -> [256, 8, 20]
        |
   Stage 3: Axial Self-Attention
   Width (temporal): 8 heads
   Height (feature): 8 heads
        |
   Decoder: Adaptive Avg Pool + Linear
   [256, 8, 20] -> pool -> [2048] -> [17, 2]
        |
   17 COCO Keypoints [x, y] in [0, 1]

Our Adaptation vs Original WiFlow

AspectWiFlow OriginalOur AdaptationReason
Input channels540 (18 links x 30 SC)128 (1 TX x 1 RX x 128 SC)Single ESP32 link
Time steps2020Same
TCN channels540 -> 256 -> 128 -> 64128 -> 256 -> 192 -> 128Proportional reduction
Spatial blocks4 (stride 2)4 (stride 2)Same
Attention heads88Same
Parameters4.82M~1.8MFewer input channels
Input typeAmplitude onlyAmplitude onlySame
Output17 x 217 x 2Same

Parameter Budget Breakdown

StageParameters% of Total
TCN (4 blocks, k=7, d=1,2,4,8)~969K54%
Asymmetric Conv (4 blocks, 1x3, stride 2)~174K10%
Axial Attention (width + height, 8 heads)~592K33%
Pose Decoder (pool + linear -> 17x2)~70K4%
Total~1.8M100%

Loss Function

L = L_H + 0.2 * L_B

L_H = SmoothL1(predicted, target, beta=0.1)
L_B = (1/14) * sum_b (bone_length_b - prior_b)^2

14 bone connections enforce anatomical constraints:

  • Nose-eye (x2): 0.06
  • Eye-ear (x2): 0.06
  • Shoulder-elbow (x2): 0.15
  • Elbow-wrist (x2): 0.13
  • Shoulder-hip (x2): 0.26
  • Hip-knee (x2): 0.25
  • Knee-ankle (x2): 0.25
  • Shoulder width: 0.20

All lengths normalized to person height.

Training Strategy (Camera-Free Pipeline)

Since we have no ground-truth pose labels from cameras, training proceeds in three phases:

Phase 1: Contrastive Pretraining

  • Temporal triplets: adjacent windows are positive pairs, distant windows are negative
  • Cross-node triplets: same-time windows from different ESP32 nodes are positive
  • Uses ruvllm ContrastiveTrainer with triplet + InfoNCE loss
  • Learns a representation where similar CSI states cluster together

Phase 2: Pose Proxy Training

  • Generate coarse pose proxies from vitals data:
    • Person detected (presence > 0.3): place standing skeleton at center
    • High motion: perturb limb positions proportional to motion energy
    • Breathing: add micro-oscillation to torso keypoints
  • Train with SmoothL1 + bone constraint loss
  • Confidence-weighted updates (higher presence = stronger gradient)

Phase 3: Self-Refinement (Future)

  • Multi-node consistency: same person seen from different nodes should produce consistent pose after geometric transform
  • Temporal smoothness: adjacent frames should produce similar poses
  • Bone constraint tightening: gradually reduce tolerance

Integration with Existing Pipeline

train-ruvllm.js (ADR-071)        train-wiflow.js (ADR-072)
  |                                  |
  | 8-dim features                   | 128-dim raw CSI amplitude
  | -> 128-dim embedding             | -> 17x2 keypoint coordinates
  | -> presence/activity/vitals      | -> bone-constrained pose
  |                                  |
  +-- ContrastiveTrainer -----+------+
  +-- TrainingPipeline -------+------+
  +-- LoRA per-node ----------+------+
  +-- TurboQuant quantize ----+------+
  +-- SafeTensors export -----+------+

Both pipelines share the ruvllm infrastructure; WiFlow adds the deeper architecture for direct pose regression while the simple encoder handles embedding tasks.

Performance Targets

MetricTargetNotes
PCK@20> 80%On lab data with 2+ nodes
Forward latency< 50msPi Zero 2W at INT8
Model size (INT8)< 2 MBTurboQuant
Bone violation rate< 10%50% tolerance
Temporal jitter< 3cmExponential smoothing

Risk Assessment

RiskSeverityMitigation
Single TX/RX has less spatial info than 18 linksHigh2-node multi-static compensates; cross-node fusion from ADR-029
Camera-free labels are coarseMediumBone constraints enforce anatomy; contrastive pretrain provides structure
Pure JS too slow for real-timeMediumINT8 quantization; axial attention is O(H^2W+HW^2) not O(H^2W^2)
Overfitting with ~5K framesMediumTemporal augmentation + noise + cross-node interpolation
Phase not available (amplitude-only)LowWiFlow was designed amplitude-only; not a limitation

Consequences

Positive

  • Proven SOTA architecture adapted to our hardware constraints
  • Pure JavaScript implementation runs everywhere ruvllm runs (Node.js, browser WASM)
  • Bone constraints enforce physically plausible outputs even with noisy inputs
  • Shares training infrastructure with existing ruvllm pipeline
  • Modular: each stage (TCN, AsymConv, Axial, Decoder) is independently testable

Negative

  • ~1.8M parameters is 193x larger than simple CsiEncoder (9,344 params)
  • Forward pass is slower (~50ms vs <1ms for simple encoder)
  • Camera-free training will produce lower accuracy than supervised WiFlow
  • No ground-truth PCK evaluation possible without camera labels
  • Axial attention is O(N^2) within each axis, limiting scalability

Neutral

  • FLOPs dominated by TCN (~48%) due to dilated convolutions
  • INT8 quantization brings model to ~1.7MB, viable for edge deployment
  • Architecture is fixed (no NAS); future work could explore lighter variants

Implementation

Files Created

FilePurpose
scripts/wiflow-model.jsWiFlow architecture (all stages, loss, metrics)
scripts/train-wiflow.jsTraining pipeline (contrastive + pose proxy + LoRA + quant)
scripts/benchmark-wiflow.jsBenchmarking (latency, params, FLOPs, memory, quality)
docs/adr/ADR-072-wiflow-architecture.mdThis document

Usage

bash
# Train on collected data
node scripts/train-wiflow.js --data data/recordings/pretrain-*.csi.jsonl

# Train with more epochs and custom output
node scripts/train-wiflow.js --data data/recordings/*.csi.jsonl --epochs 50 --output models/wiflow-v2

# Contrastive pretraining only (no labels needed)
node scripts/train-wiflow.js --data data/recordings/*.csi.jsonl --contrastive-only

# Benchmark
node scripts/benchmark-wiflow.js

# Benchmark with trained model
node scripts/benchmark-wiflow.js --model models/wiflow-v1

Dependencies

  • ruvllm (vendored at vendor/ruvector/npm/packages/ruvllm/src/)
    • ContrastiveTrainer, tripletLoss, infoNCELoss, computeGradient
    • TrainingPipeline
    • LoraAdapter, LoraManager
    • EwcManager
    • ModelExporter, SafeTensorsWriter
  • No external ML frameworks (no PyTorch, no TensorFlow, no ONNX Runtime)

References

  • WiFlow: arXiv:2602.08661
  • COCO Keypoints: https://cocodataset.org/#keypoints-2020
  • Axial Attention: Wang et al., "Axial-DeepLab", ECCV 2020
  • TCN: Bai et al., "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", 2018