ADR-072: WiFlow Pose Estimation Architecture

Status: Proposed
Date: 2026-04-02
Deciders: ruv
Relates to: ADR-071 (ruvllm Training Pipeline), ADR-070 (Self-Supervised Pretraining), ADR-024 (Contrastive CSI Embedding / AETHER), ADR-069 (Cognitum Seed CSI Pipeline)

Context

The WiFi-DensePose project needs a neural architecture that can convert raw CSI amplitude data into 17-keypoint COCO pose estimates. The existing train-ruvllm.js pipeline uses a simple 2-layer FC encoder (8 -> 64 -> 128) that produces contrastive embeddings for presence detection but cannot output spatial keypoint coordinates.

We evaluated published WiFi-based pose estimation architectures:

Architecture	Params	Input	Key Innovation	Publication
WiFlow	4.82M	540x20	TCN + AsymConv + Axial Attention	arXiv:2602.08661
WiPose	11.2M	3x3x30x20	3D CNN + heatmap regression	CVPR 2021
MetaFi++	8.6M	114x30x20	Transformer + meta-learning	NeurIPS 2023
Person-in-WiFi 3D	15.3M	Multi-antenna	Deformable attention + 3D	CVPR 2024

WiFlow is the lightest published SOTA architecture, designed specifically for commercial WiFi hardware. Its key advantage is operating on CSI amplitude only (no phase), which is critical for ESP32-S3 where phase calibration is unreliable.

Why WiFlow

Lightest SOTA: 4.82M parameters at original scale; our adaptation targets ~2.5M
Amplitude-only: Discards phase, which is noisy on consumer hardware
Published architecture: Fully specified in arXiv:2602.08661, reproducible
Temporal modeling: TCN with dilated causal convolutions captures motion dynamics
Efficient attention: Axial attention reduces O(H^2W^2) to O(H^2W + HW^2)
Proven on commercial WiFi: Validated on commodity Intel 5300 and Atheros hardware

Decision

Implement the WiFlow architecture in pure JavaScript (ruvllm native) with the following adaptations for our ESP32 single TX/RX deployment.

Architecture Overview

CSI Amplitude [128, 20]
        |
   Stage 1: TCN (Dilated Causal Conv)
   dilation = (1, 2, 4, 8), kernel = 7
   128 -> 256 -> 192 -> 128 channels
        |
   Stage 2: Asymmetric Conv Encoder
   1xk conv (k=3), stride (1,2)
   [1, 128, 20] -> [256, 8, 20]
        |
   Stage 3: Axial Self-Attention
   Width (temporal): 8 heads
   Height (feature): 8 heads
        |
   Decoder: Adaptive Avg Pool + Linear
   [256, 8, 20] -> pool -> [2048] -> [17, 2]
        |
   17 COCO Keypoints [x, y] in [0, 1]

Our Adaptation vs Original WiFlow

Aspect	WiFlow Original	Our Adaptation	Reason
Input channels	540 (18 links x 30 SC)	128 (1 TX x 1 RX x 128 SC)	Single ESP32 link
Time steps	20	20	Same
TCN channels	540 -> 256 -> 128 -> 64	128 -> 256 -> 192 -> 128	Proportional reduction
Spatial blocks	4 (stride 2)	4 (stride 2)	Same
Attention heads	8	8	Same
Parameters	4.82M	~1.8M	Fewer input channels
Input type	Amplitude only	Amplitude only	Same
Output	17 x 2	17 x 2	Same

Parameter Budget Breakdown

Stage	Parameters	% of Total
TCN (4 blocks, k=7, d=1,2,4,8)	~969K	54%
Asymmetric Conv (4 blocks, 1x3, stride 2)	~174K	10%
Axial Attention (width + height, 8 heads)	~592K	33%
Pose Decoder (pool + linear -> 17x2)	~70K	4%
Total	~1.8M	100%

Loss Function

L = L_H + 0.2 * L_B

L_H = SmoothL1(predicted, target, beta=0.1)
L_B = (1/14) * sum_b (bone_length_b - prior_b)^2

14 bone connections enforce anatomical constraints:

Nose-eye (x2): 0.06
Eye-ear (x2): 0.06
Shoulder-elbow (x2): 0.15
Elbow-wrist (x2): 0.13
Shoulder-hip (x2): 0.26
Hip-knee (x2): 0.25
Knee-ankle (x2): 0.25
Shoulder width: 0.20

All lengths normalized to person height.

Training Strategy (Camera-Free Pipeline)

Since we have no ground-truth pose labels from cameras, training proceeds in three phases:

Phase 1: Contrastive Pretraining

Temporal triplets: adjacent windows are positive pairs, distant windows are negative
Cross-node triplets: same-time windows from different ESP32 nodes are positive
Uses ruvllm ContrastiveTrainer with triplet + InfoNCE loss
Learns a representation where similar CSI states cluster together

Phase 2: Pose Proxy Training

Generate coarse pose proxies from vitals data:
- Person detected (presence > 0.3): place standing skeleton at center
- High motion: perturb limb positions proportional to motion energy
- Breathing: add micro-oscillation to torso keypoints
Train with SmoothL1 + bone constraint loss
Confidence-weighted updates (higher presence = stronger gradient)

Phase 3: Self-Refinement (Future)

Multi-node consistency: same person seen from different nodes should produce consistent pose after geometric transform
Temporal smoothness: adjacent frames should produce similar poses
Bone constraint tightening: gradually reduce tolerance

Integration with Existing Pipeline

train-ruvllm.js (ADR-071)        train-wiflow.js (ADR-072)
  |                                  |
  | 8-dim features                   | 128-dim raw CSI amplitude
  | -> 128-dim embedding             | -> 17x2 keypoint coordinates
  | -> presence/activity/vitals      | -> bone-constrained pose
  |                                  |
  +-- ContrastiveTrainer -----+------+
  +-- TrainingPipeline -------+------+
  +-- LoRA per-node ----------+------+
  +-- TurboQuant quantize ----+------+
  +-- SafeTensors export -----+------+

Both pipelines share the ruvllm infrastructure; WiFlow adds the deeper architecture for direct pose regression while the simple encoder handles embedding tasks.

Performance Targets

Metric	Target	Notes
PCK@20	> 80%	On lab data with 2+ nodes
Forward latency	< 50ms	Pi Zero 2W at INT8
Model size (INT8)	< 2 MB	TurboQuant
Bone violation rate	< 10%	50% tolerance
Temporal jitter	< 3cm	Exponential smoothing

Risk Assessment

Risk	Severity	Mitigation
Single TX/RX has less spatial info than 18 links	High	2-node multi-static compensates; cross-node fusion from ADR-029
Camera-free labels are coarse	Medium	Bone constraints enforce anatomy; contrastive pretrain provides structure
Pure JS too slow for real-time	Medium	INT8 quantization; axial attention is O(H^2W+HW^2) not O(H^2W^2)
Overfitting with ~5K frames	Medium	Temporal augmentation + noise + cross-node interpolation
Phase not available (amplitude-only)	Low	WiFlow was designed amplitude-only; not a limitation

Consequences

Positive

Proven SOTA architecture adapted to our hardware constraints
Pure JavaScript implementation runs everywhere ruvllm runs (Node.js, browser WASM)
Bone constraints enforce physically plausible outputs even with noisy inputs
Shares training infrastructure with existing ruvllm pipeline
Modular: each stage (TCN, AsymConv, Axial, Decoder) is independently testable

Negative

~1.8M parameters is 193x larger than simple CsiEncoder (9,344 params)
Forward pass is slower (~50ms vs <1ms for simple encoder)
Camera-free training will produce lower accuracy than supervised WiFlow
No ground-truth PCK evaluation possible without camera labels
Axial attention is O(N^2) within each axis, limiting scalability

Neutral

FLOPs dominated by TCN (~48%) due to dilated convolutions
INT8 quantization brings model to ~1.7MB, viable for edge deployment
Architecture is fixed (no NAS); future work could explore lighter variants

Implementation

Files Created

File	Purpose
`scripts/wiflow-model.js`	WiFlow architecture (all stages, loss, metrics)
`scripts/train-wiflow.js`	Training pipeline (contrastive + pose proxy + LoRA + quant)
`scripts/benchmark-wiflow.js`	Benchmarking (latency, params, FLOPs, memory, quality)
`docs/adr/ADR-072-wiflow-architecture.md`	This document

Usage

bash

# Train on collected data
node scripts/train-wiflow.js --data data/recordings/pretrain-*.csi.jsonl

# Train with more epochs and custom output
node scripts/train-wiflow.js --data data/recordings/*.csi.jsonl --epochs 50 --output models/wiflow-v2

# Contrastive pretraining only (no labels needed)
node scripts/train-wiflow.js --data data/recordings/*.csi.jsonl --contrastive-only

# Benchmark
node scripts/benchmark-wiflow.js

# Benchmark with trained model
node scripts/benchmark-wiflow.js --model models/wiflow-v1

Dependencies

ruvllm (vendored at vendor/ruvector/npm/packages/ruvllm/src/)
- ContrastiveTrainer, tripletLoss, infoNCELoss, computeGradient
- TrainingPipeline
- LoraAdapter, LoraManager
- EwcManager
- ModelExporter, SafeTensorsWriter
No external ML frameworks (no PyTorch, no TensorFlow, no ONNX Runtime)

References

WiFlow: arXiv:2602.08661
COCO Keypoints: https://cocodataset.org/#keypoints-2020
Axial Attention: Wang et al., "Axial-DeepLab", ECCV 2020
TCN: Bai et al., "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", 2018